In this assignment I have explored creating my own model from scratch and also explored many different architectures and have implemented it into my CA1 Deep Learning assignment these architecture includes, ResNet, DenseNet, ResNeXt and CoAtNet.
(Part A: ResNet, DenseNet) | (Part B: ResNeXt, CoAtNet)
Fashion MNIST is a dataset containing greyscaled images of clothing from the fashion company Zalando. It is a dataset developed to replace the simpler MNIST handwritten dataset.
The goal is to build, analyze and evaluate multiple models using the Fasion MNIST dataset and eventually settle on a model that is able to generalize well to new data without overfitting.
All code is original unless attributed
!nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-d8f6553d-27a8-db79-b09d-f7d2dc2f7e37) GPU 1: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-fc296e80-ac09-eb0a-bbff-271651f504c9)
!pip install matplotlib
!pip install seaborn
!pip install torch-summary
!pip install -U scikit-learn
# !pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
import torch, re, time, gc, itertools, random, os ,sys
from IPython.display import clear_output
import torch.optim as optim
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as cp
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torchvision import datasets
from torchvision import utils
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from copy import copy
import pandas as pd
from collections import OrderedDict
from typing import Any, Callable, List, Optional, Type, Union, Tuple
from sklearn.metrics import classification_report, confusion_matrix
from torchsummary import summary
# from sklearn.preprocessing import OneHotEncoder
sns.set(style="ticks")
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
To make sure results are reproduciable.
np.random.seed(2007476)
torch.manual_seed(2007476)
<torch._C.Generator at 0x7f60fd37a330>
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/My\ Drive/Skool/2022sem2/DELE/
Sources to get dataset (official): Kaggle CSV | PyTorch Raw Data Documentation
Importing and downloading the raw data from PyTorch's dataset API, turning it into a numpy so I can perform EDA on it
training_data = datasets.FashionMNIST(
root="data",
train=True,
download=True,
transform=transforms.ToTensor()
)
test_data = datasets.FashionMNIST(
root="data",
train=False,
download=True,
transform=transforms.ToTensor()
)
class_labels = training_data.classes
train_loader = DataLoader(training_data, batch_size=len(training_data))
test_loader = DataLoader(test_data, batch_size=len(test_data))
train_input = torch.Tensor(next(iter(train_loader))[0].numpy())
test_input = torch.Tensor(next(iter(test_loader))[0].numpy())
train_label = torch.Tensor(next(iter(train_loader))[1].numpy())
test_label = torch.Tensor(next(iter(test_loader))[1].numpy())
del train_loader, test_loader, training_data, test_data
This is relatively huge batch size...a lower batch size around (32-128) is better as it offer a stronger regularization effect and lower generalization error. However, to speed up my training times I have choosen a relative large batch_size I will lower the batch size during model improvement/hyperparameter tuning. Stronger data augmentation has been used as well.
BATCH_SIZE = 1024
Note that we are exploring the raw dataset of FashionMNIST i.e no data augmentation is done yet.
Training set of 60,000 examples.
Test set of 10,000 examples.
Total of 70,000 examples.
Labels & Description
| Label | Description |
|---|---|
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |
Side note: Training set will be split into 50,000 and 10,000 examples for training and validation set respectively to prevent overfitting on the test set.
labels, counts = np.unique(train_label, return_counts=True)
for label, count in zip(labels, counts):
print(f"{class_labels[int(label)]}: {count}")
fig, ax = plt.subplots()
fig.set_size_inches(8, 4)
sns.barplot(x=class_labels, y=counts, ax=ax)
plt.grid()
plt.tight_layout()
T-shirt/top: 6000 Trouser: 6000 Pullover: 6000 Dress: 6000 Coat: 6000 Sandal: 6000 Shirt: 6000 Sneaker: 6000 Bag: 6000 Ankle boot: 6000
Off the bat, I can tell that there is the same number of datapoints for each class. It has a uniform distribution in terms of type of datapoints.
How does each class of image look like?
train_input_np = train_input.numpy().reshape(60000,28,28)
fig, ax = plt.subplots(2, 5, figsize=(10, 5), tight_layout=True)
ax = ax.ravel()
for i in range(10):
images = train_input_np[np.squeeze(train_label == i)][0]
label = class_labels[i]
subplot = ax[i]
subplot.axis("off")
subplot.imshow(images, cmap="gray")
subplot.set_title(f"Label: {label}")
A larger set of images from different class
fig, ax = plt.subplots(10, 10, figsize=(20, 20))
for i in range(10):
images = train_input_np[np.squeeze(train_label == i)]
random_idxs = np.random.choice(images.shape[0], 10, replace=False)
images = images[random_idxs]
label = class_labels[i]
for j in range(10):
subplot = ax[i, j]
subplot.axis("off")
subplot.imshow(images[j], cmap="gray")
subplot.set_title(f"Label: {label}")
fig.show()
Every class seems to have differing features from one another even though they might look similar. For example, 'Coat, 'Pullover', 'Shirt' and 'T_shirt/top' may look very similar even towards the human eye. However, small difference in features can be seen for example, 'Coat' and 'Shirt' may look similar but 'Shirt' has generally a shorter sleeve and duller look. I suspect those shirt chlothing could be harder to differentiate than the other classes for our model. The images in class 'Bag' seems to differ quite different from one another, the general shape of bag may be similar but they differ very differently interms of its size and design. Some 'Ankle Boot' look rather similar to the general shape of 'Sneaker', vice visa.
How does the 'average' image of each class look like?
fig, ax = plt.subplots(2, 5, figsize=(32, 10))
for idx, subplot in enumerate(ax.ravel()):
avg_image = np.mean(train_input_np[np.squeeze(train_label == idx)], axis=0)
subplot.imshow(avg_image, cmap="gray")
subplot.set_title(f"{class_labels[idx]}")
subplot.axis("off")
How does the 'average' image look like all together?
plt.imshow(np.mean(train_input_np, axis=0), cmap="gray")
plt.show()
The 'average' image of each class shows us the general shape of each class. As expected the shirt clothing classes seem to have generally pretty similar shape, for example the 'Pullover' and 'Coat' look pretty similar except 'Pullover' seems to have a broader shoulder.
The 'average' image of every class combined shows that the center seems to be the brightest and radiate out with a gradient in a shape of a blob or potato. Which is expected since all the image seems to be centralised.
How does the pixel average distribution look like?
# To calculate the average pixel distribution from each class
avg_images_df = pd.DataFrame([],columns=['avg_pixel_values','class'])
for idx in range(len(train_label.unique())):
avg_image = np.mean(train_input_np[np.squeeze(train_label == idx)], axis=0)
avg_images_df = pd.concat([avg_images_df,pd.DataFrame(avg_image.flatten(),columns=['avg_values'])])
avg_images_df.reset_index(inplace=True,drop=True)
avg_images_df.loc[idx*784:(idx+1)*784,'class'] = class_labels[idx]
# To display the average pixel distribution of all the classes
sns.displot(avg_images_df,x='avg_values',hue='class',height=9,aspect=0.75)
plt.grid()
plt.tight_layout()
Most pixels are generally 0 (dark pixel) which is not surprising as black pixel makes up majority of the images seen perviously, there is a gussian curve from around average pixel value from 0.4 to 0.8 and peaked around 0.65. Sneaker seems to have the most dark values on average. Hence, telling us it has generally the least pixel value. From this graph we can tell that something like 'Sandal' has a lower average pixel value than 'T_ship/top' too. Not much useful insights can be seen other than these few observations, this shows that we can compare much easier by displaying images, nonetheless, it is still decently useful to compare the differing pixel values of different classes.
1) To start, the simplest way I thought of would perhaps to compare the MSE (mean squared error) of an image to the average pixel values of its class. For example, if a sneaker is very different from the average sneaker then it would be an outlier. It might work here, however, if the data has 3 channels and much more complex it might not work...
2) There are many useful tools such as PCA (Principal Component Analysis) to find outliers. These unsupervised learning techniques will group data with similar looking "components/features" together and the data furthest away from their group would be an outlier.
3) Since we are in the topic of deep learning, autoencoders would be perfect. It is an improvement over something like PCA as it uses linear algebra to transform. In contrast, the autoencoder techniques can perform non-linear transformations with their non-linear activation function and multi layers neutral network. It is more efficient to train several layers with an autoencoder, rather than training one huge transformation with PCA.
There are 2 ways I can use autoencoder to find anomalies, use a trained encoder to reduce its dimensions, datapoints with the highest distance away from its class group average would be outliers. Another way would be to compare the loss (MSE in this case) of the final reconstructed image, this works because autoencoders tries to minimize the reconstruction error as part of its training. Anomalies are detected by checking the magnitude of the reconstruction loss (MSE).

Preping dataloaders for autoencoder
train_dataloader_autoencode = DataLoader(TensorDataset(train_input.type('torch.FloatTensor'),train_label.type('torch.FloatTensor')),batch_size=1024,shuffle=True)
# Single batch loader to compare RMSE of each image in each class
def single_loader(label):
return DataLoader(TensorDataset(train_input.type('torch.FloatTensor')[train_label==label],train_label.type('torch.FloatTensor')[train_label==label]),batch_size=1,shuffle=False)
Initialising simple convolutional autoencoder, loss function, optimizer etc.
# This autoencoder is original
class SimpleAutoEncoder(nn.Module):
def __init__(self):
super(SimpleAutoEncoder, self).__init__()
# encoder
self.encoder = nn.Sequential(
nn.Conv2d(1, 64, 3), nn.BatchNorm2d(64), nn.ReLU(True),
nn.Conv2d(64, 32, 3), nn.BatchNorm2d(32), nn.ReLU(True),
nn.Conv2d(32, 16, 3), nn.BatchNorm2d(16), nn.ReLU(True)
)
# bottleneck
self.bottleneck = nn.Sequential(
nn.Flatten(),
nn.Linear(7744,256), nn.BatchNorm1d(256), nn.ReLU(True),
nn.Linear(256,7744), nn.BatchNorm1d(7744), nn.ReLU(True),
nn.Unflatten(1,(16,22,22)) # 7744 = 16 * 22 * 22 (calculation done manually)
)
# decoder
self.decoder = nn.Sequential(
nn.ConvTranspose2d(16, 32, 3), nn.BatchNorm2d(32), nn.ReLU(True),
nn.ConvTranspose2d(32, 64, 3), nn.BatchNorm2d(64), nn.ReLU(True),
nn.ConvTranspose2d(64, 1, 3), nn.ReLU(True)
)
def forward(self, x):
x = self.encoder(x)
x = self.bottleneck(x)
x = self.decoder(x)
return x
model = nn.DataParallel(SimpleAutoEncoder())
model = model.to(device)
print(model)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr= 0.00025)
def train_ae(net, trainloader, NUM_EPOCHS, displayEvery=5):
# train_loss = []
for epoch in range(NUM_EPOCHS):
running_loss = 0.0
for data in trainloader:
img, _ = data # no need for the labels
img = img.to(device)
optimizer.zero_grad()
outputs = net(img)
loss = criterion(outputs, img)
loss.backward()
optimizer.step()
running_loss += loss.item()
loss = running_loss / len(trainloader)
# train_loss.append(loss)
if (epoch+1)%displayEvery==0:
print('- Epoch {} of {}, Train Loss: {:.3f}'.format(epoch+1, NUM_EPOCHS, loss))
def get_outliers(net, trainloader):
net.eval()
running_loss = np.array([])
for data in trainloader:
img, _ = data # no need for the labels
img = img.to(device)
optimizer.zero_grad()
outputs = net(img)
loss = criterion(outputs, img)
running_loss = np.append(running_loss,np.array(loss.item()))
highest6 = np.argsort(running_loss)[::-1][:6] #top 6 highest reconstructed loss
return highest6
DataParallel(
(module): SimpleAutoEncoder(
(encoder): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1))
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(32, 16, kernel_size=(3, 3), stride=(1, 1))
(7): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
)
(bottleneck): Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=7744, out_features=256, bias=True)
(2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU(inplace=True)
(4): Linear(in_features=256, out_features=7744, bias=True)
(5): BatchNorm1d(7744, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): ReLU(inplace=True)
(7): Unflatten(dim=1, unflattened_size=(16, 22, 22))
)
(decoder): Sequential(
(0): ConvTranspose2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): ConvTranspose2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): ConvTranspose2d(64, 1, kernel_size=(3, 3), stride=(1, 1))
(7): ReLU(inplace=True)
)
)
)
train_ae(model, train_dataloader_autoencode, 50)
# Remove the dataloader from memory
del train_dataloader_autoencode
- Epoch 5 of 50, Train Loss: 0.055 - Epoch 10 of 50, Train Loss: 0.034 - Epoch 15 of 50, Train Loss: 0.023 - Epoch 20 of 50, Train Loss: 0.019 - Epoch 25 of 50, Train Loss: 0.017 - Epoch 30 of 50, Train Loss: 0.015 - Epoch 35 of 50, Train Loss: 0.014 - Epoch 40 of 50, Train Loss: 0.013 - Epoch 45 of 50, Train Loss: 0.013 - Epoch 50 of 50, Train Loss: 0.012
Get indices of the reconstructed images with the highest MSE
outliers = np.array([])
for i in range(10):
outliers = np.append(outliers,get_outliers(model,single_loader(i)))
outliers = outliers.astype(int)
Display Top 6 outliers from each class
for e in range(10):
fig, ax = plt.subplots(1, 6, figsize=(12, 10), tight_layout=True)
ax = ax.ravel()
for i in range(6):
images = train_input_np[train_label==e][outliers][e*6:(e+1)*6][i]
label = class_labels[e]
subplot = ax[i]
subplot.axis("off")
subplot.imshow(images, cmap="gray")
subplot.set_title(f"Label: {label}")
del train_input_np
del model, outliers
It seems that outliers from our dataset follows a simple general trend, their design is very 'polka-dotty' or stripy or fancy, which in greyscale makes the product look like there are holes or lines inbetween them. As for the shoes it looks like it is taken from a different angle as well. These are definitely some images that might give our model a hard time identifying
An improvement to finding outlier would be to prehaps train each class on a seperate autoencoder this way the autoencoder will be trained with only training data of their own class. Hence, outliers displayed may be more distinct to their own class. However, this will take much time.
Based on pervious analysis which augmentation method could yield the best result?
Data replication with augmentation
Preparing multiple datasets with different set of augmentation to see which is the best (for tuning model).
Only the training set should be augmented. The only augmentation that I can apply on the validation/test set would be Normalization (either min-max or z-scaled), note that the values used for Normalization must also be from the training dataset only, this is to minimize data leakage since we should not know any values from the testing data (the difference should be quite eligible but good practice).
The testing data cannot be used for evaluation at any point only until the final evaluation. These remove any decision making made by using testing data hence removing overfitting of the testing data.
No information from testing data can be used for training or validation data, and no information from validation data should be used for training data.
Note that our training and test data are untouched.
display(train_input.shape)
display(test_input.shape)
torch.Size([60000, 1, 28, 28])
torch.Size([10000, 1, 28, 28])
I have decided on 50K training, 10K validation and 10K testing data as I figure this is should be enough datapoints for me to evaluate and choose my best model. Furthermore, with a bit of background research some state-of-the-art models used similar data split to achieve their results (48K training and 12K validation is a common split too).
Resources: [ Papers With Code (FashionMNIST state-of-the-art models) ]
| Split | Size |
|---|---|
| Training | 50K |
| Validation | 10K |
| Testing | 10k |
The same training data will be used for data replication with data augmentation instead of replacing the image, this is to retain and replicate as much similar looking data as what we supposed to expect in the real-world (in this case our test data)
traindata_input = train_input[:50000]
valdata_input = train_input[50000:]
# del train_input
# One hot encoded labels
traindata_label = F.one_hot(train_label.type('torch.LongTensor')[:50000],10)
valdata_label = F.one_hot(train_label.type('torch.LongTensor')[50000:],10)
train_label = F.one_hot(train_label.type('torch.LongTensor'),10)
# del train_label
test_label = F.one_hot(test_label.type('torch.LongTensor'),10)
display(traindata_input.shape)
display(valdata_input.shape)
torch.Size([50000, 1, 28, 28])
torch.Size([10000, 1, 28, 28])
How do we know that which data augmentation is better? Does doing data augmentation really improve accuracy
Before we continue with data augmentation there are a few sets of training datasets I would like to prepare, these sets of datasets will be used to evaluate later to find out whether it is better to use augmentation. Some of these data augmentation are chosen based on my background research and intuition (sources and explanation below)
Here is what I have settled on
| Split | Size | Description |
|---|---|---|
| Trainset 1 | 50K | Normalization (min-max) |
| Trainset 2 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomErasing() |
| Trainset 3 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomRotation(-5,5) |
| Trainset 4 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomErasing(), RandomRotation(5) |
| Trainset 5 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomErasing(), RandomRotation(-5,5) |
| Trainset 6 (Replicated Data) | 50K + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomPerspective(0.115,0.4), RandomErasing() |
Why scale data anyways? Many experiments have that normalization have improve neutral network stability and modeling performance. Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error. By scaling your data the multi-dimensional local minima will be more 'symmtrical' and less spread out as well since the values are scaled. Hence, improving model performances [Andrew Ng, 2017, unofficial source].
Trainset 1: A simple dataset that does not have any data augmentation besides min-max normalization
Trainset 2: This dataset contain 100K samples where the first 50K is normal datapoints that are normalized (min-max), and RandomHorizontalFlip is applied to replicate more data samples while still being slightly different (especially for shoes). RandomErasing is used to implement noise into our dataset, which also gives the affect of regularization into our normal. The researchers that first implemented RandomErasing were able to achieve state-of-the-art performances on FashionMNIST & ImageNet back in 2017 [Zhong, 2017]. However, note that newer augmentation methods have since outperformed random erasing (similar to CutOut).
Trainset 3: This dataset has RandomRotation instead of RandomErasing. RandomErasing may be providing too much noise therefore it is replaced with RandomRotation. RandomRotation of 5 degrees is introduced. This extra rotate may introduced a good amount of replication and noise into our dataset that it might improve our model. Model evaluation would still has to be done with all these datasets.
Trainset 4: This dataset contain the same combined augmentation configuration as Trainset 3&4.
Trainset 5: This dataset contain the same augmentation configuration as Trainset 5. However, horizontalflip is removed because most of the shoe data are pointed towards the left, perhaps, the model might produce a better result without horizontal flipping.
Trainset 6: This dataset contain a combined augmentation effect of slight random rotation and distortion to create random perspective as though the image is viewed from a different angle. A low distortiona and probability is used, and RandomErasing is used to implement noise.
Since this dataset only contain 1 channel (greyscale), I will opt out on using augmentation that augment the color of the image
min-max scaling:
${x'}$ = $\frac{x - min(x)}{max(x) - min(x)}$
z-score scaling:
${z}$ = $\frac{x - \mu}{\sigma}$
where,
${\mu}$ = mean pixel value of our training dataset
${\sigma}$ = std pixel value of our training dataset
train_mean = traindata_input.mean()
train_std = traindata_input.std()
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(traindata_input.min(), traindata_input.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(train_mean, train_std))
Min Pixel Value: 0.0 Max Pixel Value: 1.0 Mean Pixel Value 0.2854989171028137 Pixel Values Std: 0.3527844548225403
Since min pixel value is 0 the formula for min-max scaling simiplifies to
${x'}$ = $\frac{x}{1}$
...seems like my dataset is already normalized when I downloaded it from the PyTorch API. However, usually I would have to divide it by 255 on raw images (max pixel value - min pixel value)
trainset1 = traindata_input
# These dataset will remain unchange, just changing name
valset = valdata_input
testset = test_input
del valdata_input, test_input
display(trainset1.min())
display(trainset1.max())
tensor(0.)
tensor(1.)
def DataAugment(train_set,composeAug,batch_size=1):
# if single_batch:
stack = torch.tensor([], dtype=torch.float)
for i in range(int(len(train_set)/batch_size)):
stack = torch.cat((stack,composeAug(train_set[i*batch_size:(i+1)*batch_size])))
return stack
# Initialization of different augmentation methods
hFlip_RE = transforms.Compose([transforms.RandomHorizontalFlip(0.8),
transforms.RandomErasing()])
hFlip_Rotate = transforms.Compose([transforms.RandomHorizontalFlip(0.8),
transforms.Compose([transforms.RandomRotation((-5,5))])])
RE_Rotate = transforms.Compose([transforms.RandomErasing(),
transforms.Compose([transforms.RandomRotation((-5,5))])])
randRotate = transforms.Compose([transforms.RandomRotation((-5,5))])
randPrespec = transforms.Compose([transforms.RandomHorizontalFlip(0.8),transforms.RandomErasing(),transforms.RandomPerspective(0.115,0.4)])
trainset2_rep = DataAugment(trainset1,hFlip_RE,50)
trainset2 = torch.cat((trainset1,trainset2_rep))
trainset2.shape
torch.Size([100000, 1, 28, 28])
trainset3_rep = DataAugment(trainset1,hFlip_Rotate,50)
trainset3 = torch.cat((trainset1,trainset3_rep))
del trainset3_rep
trainset3.shape
torch.Size([100000, 1, 28, 28])
trainset4_rep = DataAugment(trainset2_rep,randRotate,50)
trainset4 = torch.cat((trainset1,trainset4_rep))
trainset4.shape
torch.Size([100000, 1, 28, 28])
trainset5_rep = DataAugment(trainset1,RE_Rotate,50)
trainset5 = torch.cat((trainset1,trainset5_rep))
# Training data with 100K labels
trainset5.shape
torch.Size([100000, 1, 28, 28])
trainset6_rep = DataAugment(trainset1,randPrespec,50)
trainset6 = torch.cat((trainset1,trainset6_rep))
del RE_Rotate, trainset5_rep, trainset2_rep, hFlip_Rotate, randPrespec
# Training data with 100K labels
traindata_label_100k = torch.cat((traindata_label,traindata_label))
trainset6.shape
torch.Size([100000, 1, 28, 28])
trainset4
fig, ax = plt.subplots(10, 10, figsize=(20, 20))
for i in range(10):
images = trainset4_rep.reshape(-1,28,28)[np.squeeze(traindata_label.argmax(-1) == i)]
random_idxs = np.random.choice(images.shape[0], 10, replace=False)
images = images[random_idxs]
label = class_labels[i]
for j in range(10):
subplot = ax[i, j]
subplot.axis("off")
subplot.imshow(images[j], cmap="gray")
subplot.set_title(f"Label: {label}")
fig.show()
trainset6
fig, ax = plt.subplots(10, 10, figsize=(20, 20))
for i in range(10):
images = trainset6_rep.reshape(-1,28,28)[np.squeeze(traindata_label.argmax(-1) == i)]
random_idxs = np.random.choice(images.shape[0], 10, replace=False)
images = images[random_idxs]
label = class_labels[i]
for j in range(10):
subplot = ax[i, j]
subplot.axis("off")
subplot.imshow(images[j], cmap="gray")
subplot.set_title(f"Label: {label}")
fig.show()
I can tell that all the augmentation techniques are working. Random erasing is randoming random chunk of the data with different sizes, it is also sometimes shifting a part of the image. horizontal flip is also working as most of the shoes are now pointed to the right. The rotation is not so obvious but I can see some of the shoes being slightly more rotated than others, so rotation is also working. Lastly, RandomPrespective is also working since I can see some rotation and distortion when trainset6 is displayed.
So how has the 'average' replicated image of each class look like now?
fig, ax = plt.subplots(2, 5, figsize=(32, 10))
for idx, subplot in enumerate(ax.ravel()):
avg_image = np.mean(trainset4_rep.reshape(-1,28,28).numpy()[np.squeeze(traindata_label.argmax(-1) == idx)], axis=0) / 255
subplot.imshow(avg_image, cmap="gray")
subplot.set_title(f"{class_labels[idx]}")
subplot.axis("off")
del trainset4_rep, trainset6_rep
The 'average' of the augmented data are generally blurrier, which is expected since more noise is introduced such as rotation, random erasing and even horizontal flip as seen by the shoes. Most shoes are now pointed the right direction instead of left which may be a problem for the model since our test data contain only shoes pointing left. However, with more shoes pointing right this may help increase the regularization effect and feature extraction capabilities. Model evaluation on different augmentation data is needed.
Goals:
Loss & Accuracy Plotter:
def plot_history(train_loss, train_acc, val_loss, val_acc, epoch_his,title):
fig, (ax2, ax1) = plt.subplots(1, 2, figsize=(16, 6))
sns.lineplot(x=epoch_his,y=train_loss, ax=ax1, label='Train Loss')
sns.lineplot(x=epoch_his,y=val_loss, ax=ax1, label='Validation Loss')
sns.lineplot(x=epoch_his,y=train_acc, ax=ax2, label='Train Accuracy')
sns.lineplot(x=epoch_his,y=val_acc, ax=ax2, label='Test Accuracy')
ax1.title.set_text(f'Lowest val loss @{min(val_loss):.4f} @Epoch {epoch_his[val_loss.index(min(val_loss))]}')
ax2.title.set_text(f'Highest val acc @{max(val_acc):.4f} @Epoch {epoch_his[val_acc.index(max(val_acc))]}')
ax1.axvline(epoch_his[val_loss.index(min(val_loss))],color='r',ls='--',alpha=0.75)
ax2.axvline(epoch_his[val_acc.index(max(val_acc))],color='r',ls='--',alpha=0.75)
ax1.axhline(min(val_loss),color='r',ls='--',alpha=0.75)
ax2.axhline(max(val_acc),color='r',ls='--',alpha=0.75)
fig.suptitle(title)
plt.show()
Early Stopper Class: Init parameters take in patience (number of epochs of no improvement to validation loss & accuracy before stopping the model)
class EarlyStopping():
def __init__(self, patience=5):
self.patience = patience
self.counter = 0
self.early_stop = False
self.lowest_val_loss = 99999999
self.highest_val_accuracy = 0
def __call__(self, val_acc, val_loss):
if val_loss >= self.lowest_val_loss and val_acc <= self.highest_val_accuracy:
self.counter +=1
else:
self.counter = 0
if val_acc > self.highest_val_accuracy:
self.highest_val_accuracy = val_acc
if val_loss < self.lowest_val_loss:
self.lowest_val_loss = val_loss
if self.counter >= self.patience:
self.early_stop = True
return self.early_stop
def reset(self):
self.counter = 0
self.lowest_val_loss = 99999999
self.highest_val_accuracy = 0
self.early_stop = False
Model History Logger: Simple function to append informaton on the different models' highest validation loss & accuracy and lowest train loss
def keep_history(df,train_acc,train_loss,val_loss, val_acc, epoch_his,model_desc='NIL',desc='NIL'):
return pd.concat([df,pd.DataFrame([[min(val_loss),max(val_acc),min(train_loss),max(train_acc),epoch_his[val_acc.index(max(val_acc))],model_desc,desc]],columns=['Lowest Val Loss','Highest Val Acc','Lowest Train Loss','Highest Train Acc','Epoch (Highest Val Acc)', 'Model Description', 'Parameter'])])
hist_df = pd.DataFrame([],columns=['Lowest Val Loss','Highest Val Acc','Lowest Train Loss','Highest Train Acc','Epoch (Highest Val Acc)', 'Model Description', 'Parameter'])
Train & Validate Functions: Trains the model and return a history of train loss & accuracy as well as validation loss & accuracy.
def train(model, loader, optimizer, NUM_EPOCHS, criterion, valloader=None,earlystopper=None,showEpoch=True,scheduler=None,displayEvery=1):
train_loss_his = []
train_accuracy_his = []
val_loss_his = []
val_accuracy_his = []
epoch_his = []
earlystopper.reset()
model.train()
for epoch in range(NUM_EPOCHS):
t0 = time.time()
total = 0
correct = 0
running_loss = 0.0
for data in loader:
img, label = data
img = img.to(device)
label = label.to(device)
optimizer.zero_grad()
outputs = model(img)
loss = criterion(outputs, label)
loss.backward()
optimizer.step()
# _, predicted = torch.max(outputs.data, 1)
correct += (outputs.argmax(-1) == label.argmax(-1)).sum().item()
total += label.size(0)
running_loss += loss.item()
if scheduler is not None:
scheduler.step()
accuracy = 100 * correct / total
loss = running_loss / len(loader)
train_loss_his.append(loss)
train_accuracy_his.append(accuracy)
val_loss, val_accuracy = validate(model, criterion, valloader)
val_loss_his.append(val_loss)
val_accuracy_his.append(val_accuracy)
epoch_his.append(epoch+1)
if showEpoch and (epoch+1)%displayEvery == 0:
print(f'- [Epoch {epoch+1}/{NUM_EPOCHS}] | Train Loss: {loss:.3f}| Train Accuracy: {accuracy} | Val Loss: {val_loss:.3f} | Val Accuracy: {val_accuracy} | Est: {(time.time() - t0)*displayEvery:.2f}s')
if earlyStopper(val_accuracy,val_loss):
print(f'EarlyStopper triggered at epochs: {epoch+1} \n*No improvement to validation loss and accuracy could be seen for the past {earlyStopper.patience} epochs')
break
print(f'Highest Val Accuracy: {max(val_accuracy_his)} @ epoch {epoch_his[val_accuracy_his.index(max(val_accuracy_his))]} | Lowest Val Loss: {min(val_loss_his)} @ epoch {epoch_his[val_loss_his.index(min(val_loss_his))]}')
return train_loss_his, train_accuracy_his, val_loss_his, val_accuracy_his,epoch_his
def validate(model, criterion, loader=None):
correct = 0
total = 0
running_loss = []
model.eval()
for i, data in enumerate(loader):
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
# _, predicted = torch.max(outputs.data, 1)
correct += (outputs.argmax(-1) == labels.argmax(-1)).sum().item()
running_loss.append(loss.item())
total += labels.size(0)
mean_val_accuracy = (100 * correct / total)
mean_val_loss = sum(running_loss) / len(running_loss)
model.train()
return mean_val_loss, mean_val_accuracy
Goal is to make a few models and evaluate them eventually and use it for hyperparameter tuning
Layers inspired by the VGG16 architecture [VGG16 Architecture]
VGGNet16 is relatively simple and this 3 model architecture is inspired by it. However, the original VGGNet lacks batch normalization, which is proven to improve model generalization and batch normalization also provide slight regularization affect (this affect is lower if batch size is larger [because it has more samples each batch to normalize on]) [Sergey Ioffe, 2015]. Hence, batch normalization will be used
Each Conv2d for feature extraction part will follow a generic pattern:
Conv2d3x3 -> BatchNorm2d -> ReLU -> MaxPool2d
Classification (Fully connected layer):
Linear(Dropout-0.5) -> BatchNorm1d -> ReLU
def simpleBlock(inp_channel, layers, maxpoolStride1 = False):
block = nn.Sequential()
for i in range(layers-1):
block.add_module(f'conv_layer_{i+1}', nn.Conv2d(inp_channel, inp_channel,kernel_size=3,stride=1,padding=1))
block.add_module(f'batch_norm_{i+1}',nn.BatchNorm2d(inp_channel))
block.add_module(f'ReLU_{i+1}',nn.ReLU(inplace=True))
if layers-1 > 0 and not maxpoolStride1:
block.add_module('max_pool2x2',nn.MaxPool2d(2, 2))
elif layers-1 > 0 and maxpoolStride1:
block.add_module('max_pool2x2',nn.MaxPool2d(2, 1))
return block
class SimpleNet1(nn.Module):
def __init__(self,layer_config,inp,out,growthRate=2,fc_neurons=256):
super(SimpleNet1, self).__init__()
# Implement the sequential module for feature extraction
self.start = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
#Expand
inp = out
out = out * growthRate
self.block1 = simpleBlock(inp,layer_config[0])
self.block2 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block2_add = simpleBlock(out, layer_config[1])
self.flatten = nn.Flatten()
self.classifier = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(out * 7 * 7, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True), nn.Dropout(p=0.5),
nn.Linear(fc_neurons, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True),
nn.Linear(fc_neurons, 10)
)
# CrossEntropyLoss already have softmax activation
def forward(self, x):
# Features extraction
x = self.start(x)
x = self.block1(x)
x = self.block2(x)
x = self.block2_add(x)
x = self.flatten(x)
x = self.classifier(x)
return x
class SimpleNet2(nn.Module):
def __init__(self,layer_config,inp,out,growthRate=2,fc_neurons=512):
super(SimpleNet2, self).__init__()
# Implement the sequential module for feature extraction
self.start = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
inp = out
out = out * growthRate
self.block1 = simpleBlock(inp,layer_config[0],True)
self.block2 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block2_add = simpleBlock(out, layer_config[1])
#Expand
inp = out
out = out * growthRate
self.block3 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block3_add = simpleBlock(out, layer_config[2])
self.flatten = nn.Flatten()
self.classifier = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(out * 6 * 6, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True), nn.Dropout(p=0.5),
nn.Linear(fc_neurons, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True),
nn.Linear(fc_neurons, 10)
)
# CrossEntropyLoss already have softmax activation
def forward(self, x):
# Features extraction
x = self.start(x)
x = self.block1(x)
x = self.block2(x)
x = self.block2_add(x)
x = self.block3(x)
x = self.block3_add(x)
x = self.flatten(x)
x = self.classifier(x)
return x
class SimpleNet3(nn.Module):
def __init__(self,layer_config,inp,out,growthRate=2,fc_neurons=1024):
super(SimpleNet3, self).__init__()
# Implement the sequential module for feature extraction
self.start = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=5, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
inp = out
out = out * growthRate
self.block1 = simpleBlock(inp,layer_config[0])
self.block2 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block2_add = simpleBlock(out, layer_config[1])
#Expand
inp = out
out = out * growthRate
self.block3 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block3_add = simpleBlock(out, layer_config[2])
#Expand
inp = out
out = out * growthRate
self.block4 = nn.Sequential(nn.Conv2d(in_channels=inp, out_channels=out, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out), nn.ReLU(inplace=True))
self.block4_add = simpleBlock(out, layer_config[3], True)
self.flatten = nn.Flatten()
self.classifier = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(out *2 * 2, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True), nn.Dropout(p=0.5),
nn.Linear(fc_neurons, fc_neurons), nn.BatchNorm1d(fc_neurons), nn.ReLU(inplace=True),
nn.Linear(fc_neurons, 10)
)
# CrossEntropyLoss already have softmax activation
def forward(self, x):
# Features extraction
x = self.start(x)
x = self.block1(x)
x = self.block2(x)
x = self.block2_add(x)
x = self.block3(x)
x = self.block3_add(x)
x = self.block4(x)
x = self.block4_add(x)
x = self.flatten(x)
x = self.classifier(x)
return x
ResNet also known as residual network present a residual learning framework to ease the training of networks that are substantially deeper than those used previously back in 2015. It was 8x deeper than VGG nets which was considered one of the state-of-the-art architect during that period. ResNet achieved a state-of-the-art performance winning 1st place in ILSVRC 2015 with an error rate of 3.57% on ImageNet (Top 5) [He et al., 2015].
Skip connections or shortcuts from a residual block are used to jump over some layers to the next residual block then implementing some 'residual' into the new layer thus overcoming the problem with vanishing graident.
Types of modification to see: The original ResNet152 was built for ImageNet which has 1.2 million training images and 1000 classifications, but our assignment only has 70,000 and 10 classification. Thus, I will again make 3 custom ResNet models of differenting layers, neurons and complexity the original models may be too complex for our current dataset.
Code to initilize the ResNet, from original source code: [ResNet PyTorch - Github]
Edits made to original source code by:
def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d:
# 3x3 convolution with padding
return nn.Conv2d(
in_planes,
out_planes,
kernel_size=3,
stride=stride,
padding=dilation,
groups=groups,
bias=False,
dilation=dilation,
)
def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
"""1x1 convolution"""
return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)
class BasicBlock(nn.Module):
expansion: int = 1
def __init__(
self,
inplanes: int,
planes: int,
stride: int = 1,
downsample: Optional[nn.Module] = None,
groups: int = 1,
base_width: int = 64,
dilation: int = 1,
norm_layer: Optional[Callable[..., nn.Module]] = None,
) -> None:
super().__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
if groups != 1 or base_width != 64:
raise ValueError("BasicBlock only supports groups=1 and base_width=64")
if dilation > 1:
raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
# Both self.conv1 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv3x3(inplanes, planes, stride)
self.bn1 = norm_layer(planes)
self.relu = nn.ReLU(inplace=True)
self.conv2 = conv3x3(planes, planes)
self.bn2 = norm_layer(planes)
self.downsample = downsample
self.stride = stride
def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class Bottleneck(nn.Module):
expansion: int = 4
def __init__(
self,
inplanes: int,
planes: int,
stride: int = 1,
downsample: Optional[nn.Module] = None,
groups: int = 1,
base_width: int = 64,
dilation: int = 1,
norm_layer: Optional[Callable[..., nn.Module]] = None,
) -> None:
super().__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.0)) * groups
# Both self.conv2 and self.downsample layers downsample the input when stride != 1
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
self.conv2 = conv3x3(width, width, stride, groups, dilation)
self.bn2 = norm_layer(width)
self.conv3 = conv1x1(width, planes * self.expansion)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
self.stride = stride
def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
out = self.relu(out)
out = self.conv3(out)
out = self.bn3(out)
if self.downsample is not None:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet(nn.Module):
def __init__(
self,
block: Type[Union[BasicBlock, Bottleneck]],
layers: List[int],
num_classes: int = 10,
zero_init_residual: bool = False,
groups: int = 1,
width_per_group: int = 64,
replace_stride_with_dilation: Optional[List[bool]] = None,
norm_layer: Optional[Callable[..., nn.Module]] = None,
) -> None:
super().__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
self._norm_layer = norm_layer
self.inplanes = 64
self.dilation = 1
if replace_stride_with_dilation is None:
# each element in the tuple indicates if we should replace
# the 2x2 stride with a dilated convolution instead
replace_stride_with_dilation = [False, False, False]
if len(replace_stride_with_dilation) != 3:
raise ValueError(
"replace_stride_with_dilation should be None "
f"or a 3-element tuple, got {replace_stride_with_dilation}"
)
self.groups = groups
self.base_width = width_per_group
self.conv1 = nn.Conv2d(1, self.inplanes, kernel_size=5, stride=2, padding=3, bias=False)
self.bn1 = norm_layer(self.inplanes)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(block, 64, layers[0])
self.layer2 = self._make_layer(block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0])
self.layer3 = self._make_layer(block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1])
self.layer4 = self._make_layer(block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2])
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512 * block.expansion, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
if zero_init_residual:
for m in self.modules():
if isinstance(m, Bottleneck) and m.bn3.weight is not None:
nn.init.constant_(m.bn3.weight, 0) # type: ignore[arg-type]
elif isinstance(m, BasicBlock) and m.bn2.weight is not None:
nn.init.constant_(m.bn2.weight, 0) # type: ignore[arg-type]
def _make_layer(
self,
block: Type[Union[BasicBlock, Bottleneck]],
planes: int,
blocks: int,
stride: int = 1,
dilate: bool = False,
) -> nn.Sequential:
norm_layer = self._norm_layer
downsample = None
previous_dilation = self.dilation
if dilate:
self.dilation *= stride
stride = 1
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.Sequential(
conv1x1(self.inplanes, planes * block.expansion, stride),
norm_layer(planes * block.expansion),
)
layers = []
layers.append(
block(
self.inplanes, planes, stride, downsample, self.groups, self.base_width, previous_dilation, norm_layer
)
)
self.inplanes = planes * block.expansion
for _ in range(1, blocks):
layers.append(
block(
self.inplanes,
planes,
groups=self.groups,
base_width=self.base_width,
dilation=self.dilation,
norm_layer=norm_layer,
)
)
return nn.Sequential(*layers)
def _forward_impl(self, x: torch.Tensor) -> torch.Tensor:
# See note [TorchScript super()]
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self._forward_impl(x)
Custom ResNet (3 Blocks): The original ResNet architecture contain 4 blocks. However, since we are using a simplier dataset I will create a ResNet archietecture with 3 blocks to reduce its complexity.
class block(nn.Module):
def __init__(
self, in_channels, intermediate_channels, identity_downsample=None, stride=1
):
super(block, self).__init__()
self.expansion = 4
# 2D 1x1 conv layer
self.conv1 = nn.Conv2d(
in_channels, intermediate_channels, kernel_size=1, stride=1, padding=0, bias=False
)
# Batch normalization
self.bn1 = nn.BatchNorm2d(intermediate_channels)
# input = output filters (3x3 conv)
self.conv2 = nn.Conv2d(
intermediate_channels,
intermediate_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False
)
# Batch normalization
self.bn2 = nn.BatchNorm2d(intermediate_channels)
# This layer will expand the number of filters used (1x1 conv)
self.conv3 = nn.Conv2d(
intermediate_channels,
intermediate_channels * self.expansion,
kernel_size=1,
stride=1,
padding=0,
bias=False
)
self.bn3 = nn.BatchNorm2d(intermediate_channels * self.expansion)
self.relu = nn.ReLU()
# Downsample (this identity is used as a residual when skipping connections)
self.identity_downsample = identity_downsample
self.stride = stride
def forward(self, x):
identity = x.clone()
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.conv2(x)
x = self.bn2(x)
x = self.relu(x)
x = self.conv3(x)
x = self.bn3(x)
# If this is the connection that is skipped to use identity
if self.identity_downsample is not None:
identity = self.identity_downsample(identity)
# Add residual (identity) to connection (if exist)
x += identity
x = self.relu(x)
return x
# ResNet initializing
class ResNet3Blocks(nn.Module):
def __init__(self, block, layers, image_channels = 1, num_classes = 10):
super(ResNet3Blocks, self).__init__()
self.in_channels = 64
self.conv1 = nn.Conv2d(image_channels, 64, kernel_size=5, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU()
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
# Essentially the 3 main layers/blocks of the network
# See below for _make_layer code and comments
self.layer1 = self._make_layer(
block, layers[0], intermediate_channels=64, stride=1
)
self.layer2 = self._make_layer(
block, layers[1], intermediate_channels=128, stride=2
)
self.layer3 = self._make_layer(
block, layers[2], intermediate_channels=256, stride=2
)
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(256 * 4, num_classes)
# Make the layer of ResNet
def _make_layer(self, block, num_residual_blocks, intermediate_channels, stride):
identity_downsample = None
layers = []
# Code to change input space for identity so it will be able to be added to the layer
if stride != 1 or self.in_channels != intermediate_channels * 4:
identity_downsample = nn.Sequential(
nn.Conv2d(
self.in_channels,
intermediate_channels * 4,
kernel_size=1,
stride=stride,
bias=False
),
nn.BatchNorm2d(intermediate_channels * 4),
)
layers.append(
block(self.in_channels, intermediate_channels, identity_downsample, stride)
)
# The expansion size is always 4 for ResNet 50,101,152
self.in_channels = intermediate_channels * 4
# Finally adding the rest of the layers
for i in range(num_residual_blocks - 1):
layers.append(block(self.in_channels, intermediate_channels))
# Return all the layers
return nn.Sequential(*layers)
# Forward pass
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.avgpool(x)
x = x.reshape(x.shape[0], -1)
x = self.fc(x)
return x
Custom ResNet3Blocks class parameter goes as follow ResNet(Bottleneck, block configuration)
DenseNet is similar to ResNet where its residual is also passed down as it goes down the layers but what makes DenseNet different from ResNet would be the connectivity pattern of its residual. What does this mean?

Basically the idea behind DenseNet is that the input of the next layer is a combination of the input and residual from all the pervious layers combined.
This connectivity pattern has shown improvements over ResNet architecture.
Code referenced from https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py
Edits made to original source code by:
Greatly reduced the original source code by:
class _DenseLayer(nn.Module):
def __init__(
self, num_input_features: int, growth_rate: int, bn_size: int, drop_rate: float, memory_efficient: bool = False
) -> None:
super().__init__()
self.norm1 = nn.BatchNorm2d(num_input_features)
self.relu1 = nn.ReLU(inplace=True)
self.conv1 = nn.Conv2d(num_input_features, bn_size * growth_rate, kernel_size=1, stride=1, bias=False)
self.norm2 = nn.BatchNorm2d(bn_size * growth_rate)
self.relu2 = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(bn_size * growth_rate, growth_rate, kernel_size=3, stride=1, padding=1, bias=False)
self.drop_rate = float(drop_rate)
self.memory_efficient = memory_efficient
def bn_function(self, inputs: List[torch.Tensor]) -> torch.Tensor:
concated_features = torch.cat(inputs, 1)
bottleneck_output = self.conv1(self.relu1(self.norm1(concated_features))) # noqa: T484
return bottleneck_output
# todo: rewrite when torchscript supports any
def any_requires_grad(self, input: List[torch.Tensor]) -> bool:
for tensor in input:
if tensor.requires_grad:
return True
return False
@torch.jit.unused # noqa: T484
def call_checkpoint_bottleneck(self, input: List[torch.Tensor]) -> torch.Tensor:
def closure(*inputs):
return self.bn_function(inputs)
return cp.checkpoint(closure, *input)
@torch.jit._overload_method # noqa: F811
def forward(self, input: List[torch.Tensor]) -> torch.Tensor: # noqa: F811
pass
@torch.jit._overload_method # noqa: F811
def forward(self, input: torch.Tensor) -> torch.Tensor: # noqa: F811
pass
# torchscript does not yet support *args, so we overload method
# allowing it to take either a List[Tensor] or single Tensor
def forward(self, input: torch.Tensor) -> torch.Tensor: # noqa: F811
if isinstance(input, torch.Tensor):
prev_features = [input]
else:
prev_features = input
if self.memory_efficient and self.any_requires_grad(prev_features):
if torch.jit.is_scripting():
raise Exception("Memory Efficient not supported in JIT")
bottleneck_output = self.call_checkpoint_bottleneck(prev_features)
else:
bottleneck_output = self.bn_function(prev_features)
new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
if self.drop_rate > 0:
new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
return new_features
class _DenseBlock(nn.ModuleDict):
_version = 2
def __init__(
self,
num_layers: int,
num_input_features: int,
bn_size: int,
growth_rate: int,
drop_rate: float,
memory_efficient: bool = False,
) -> None:
super().__init__()
for i in range(num_layers):
layer = _DenseLayer(
num_input_features + i * growth_rate,
growth_rate=growth_rate,
bn_size=bn_size,
drop_rate=drop_rate,
memory_efficient=memory_efficient,
)
self.add_module("denselayer%d" % (i + 1), layer)
def forward(self, init_features: torch.Tensor) -> torch.Tensor:
features = [init_features]
for name, layer in self.items():
new_features = layer(features)
features.append(new_features)
return torch.cat(features, 1)
class _Transition(nn.Sequential):
def __init__(self, num_input_features: int, num_output_features: int) -> None:
super().__init__()
self.norm = nn.BatchNorm2d(num_input_features)
self.relu = nn.ReLU(inplace=True)
self.conv = nn.Conv2d(num_input_features, num_output_features, kernel_size=1, stride=1, bias=False)
self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
class DenseNet(nn.Module):
def __init__(
self,
growth_rate: int = 32,
block_config: Tuple[int, int, int, int] = (6, 12, 24, 16),
num_init_features: int = 64,
bn_size: int = 4,
drop_rate: float = 0,
num_classes: int = 10,
memory_efficient: bool = False,
) -> None:
super().__init__()
# First convolution
self.features = nn.Sequential(
OrderedDict(
[
("conv0", nn.Conv2d(1, num_init_features, kernel_size=5, stride=2, padding=3, bias=False)),
("norm0", nn.BatchNorm2d(num_init_features)),
("relu0", nn.ReLU(inplace=True)),
("pool0", nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
]
)
)
# Each denseblock
num_features = num_init_features
for i, num_layers in enumerate(block_config):
block = _DenseBlock(
num_layers=num_layers,
num_input_features=num_features,
bn_size=bn_size,
growth_rate=growth_rate,
drop_rate=drop_rate,
memory_efficient=memory_efficient,
)
self.features.add_module("denseblock%d" % (i + 1), block)
num_features = num_features + num_layers * growth_rate
if i != len(block_config) - 1:
trans = _Transition(num_input_features=num_features, num_output_features=num_features // 2)
self.features.add_module("transition%d" % (i + 1), trans)
num_features = num_features // 2
# Final batch norm
self.features.add_module("norm5", nn.BatchNorm2d(num_features))
# Linear layer
self.classifier = nn.Linear(num_features, num_classes)
# Official init from torch repo.
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.constant_(m.bias, 0)
def forward(self, x: torch.Tensor) -> torch.Tensor:
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.adaptive_avg_pool2d(out, (1, 1))
out = torch.flatten(out, 1)
out = self.classifier(out)
return out
Parameters for DenseNet code above (based on my understanding of this code):
DenseNet(growthrate, (block configuration), output)
DenseNet will be the last architecture I will be exploring in Part A
Early Stopper set to 18
# init early stopper with patience of 18
earlyStopper = EarlyStopping(18)
Loading validation and trainset1 (min-max normalized dataset)
trainloader1 = DataLoader(TensorDataset(trainset1.type('torch.FloatTensor'),traindata_label.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
valloader = DataLoader(TensorDataset(valset.type('torch.FloatTensor'),valdata_label.type('torch.FloatTensor')), shuffle=False, batch_size=BATCH_SIZE)
OwnBaseline1 = SimpleNet1([2,2],1,32,2,256)
print('OwnBaseline1')
summary(OwnBaseline1,(1,28,28))
OwnBaseline2 = SimpleNet2([2,2,2],1,32,2,512)
print('\nOwnBaseline2')
summary(OwnBaseline2,(1,28,28))
print('\nOwnBaseline3')
OwnBaseline3 = SimpleNet3([2,3,4,4],1,32,2,1024)
summary(OwnBaseline3,(1,28,28))
ResNet4Blocks = ResNet(Bottleneck,[1,1,1,1])
print('\nResNet4Blocks')
summary(ResNet4Blocks,(1,28,28))
ResNet3Blocks = ResNet3Blocks(block,[1,2,1],1,10)
print('\nResNet3Blocks')
summary(ResNet3Blocks,(1,28,28))
DenseNet4Blocks = DenseNet(24,(3,6,12,8),10)
print('\nDenseNet4Blocks')
summary(DenseNet4Blocks,(1,28,28))
DenseNet3Blocks = DenseNet(16,(8,20,15),10)
print('\nDenseNet3Blocks')
summary(DenseNet3Blocks,(1,28,28))
From the above summaries I will give each baseline model an alies and compile the numbers of parameters they all have.
| Model | Layers (Conv+FC) | Parameters |
|---|---|---|
| OwnBaseline1 | 7 | 937,834 |
| OwnBaseline2 | 9 | 2,917,482 |
| OwnBaseline3 | 16 | 4,801,578 |
| ResNet4Blocks | 14 | 8,028,618 |
| ResNet3Blocks | 14 | 2,258,890 |
| DenseNet4Blocks | 93 | 1,253,276 |
| DenseNet3Blocks | 93 | 1,122,830 |
Off the bat, I can tell that the DenseNet architecture has more layers than the rest but it is not surprising as DenseNet scale up by increasing its depth, something good to explore would be EfficientNet's compound scaling.
Initializing criterion
criterion = nn.CrossEntropyLoss()
A simple function to reduce repetition of code...
def _train(model,model_desc,parameters_desc,fast=True,showEpochs = True,trainloader = trainloader1, df = hist_df,displayEvery=10):
global hist_df
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model = model.to(device)
else:
model = model.to(device)
if fast:
optimizer = optim.AdamW(model.parameters(),lr=0.008, weight_decay=0.1)
else:
optimizer = optim.AdamW(model.parameters(),lr=0.004, weight_decay=0.01)
train_loss, train_acc, val_loss, val_acc, epoch_his = train(model,trainloader,optimizer,80,criterion,valloader,earlyStopper,showEpochs,displayEvery=displayEvery)
df = keep_history(df,train_acc,train_loss, val_loss, val_acc, epoch_his,model_desc,parameters_desc)
return train_loss, train_acc, val_loss, val_acc, epoch_his, df
It may be rather annoying to show all the epochs for all models...so I will only display accuracy/loss every 10th epoch
Training of OwnBaselines (VGGNet inspired)
print('OwnBaseline1')
train_loss1, train_acc1, val_loss1, val_acc1, epoch_his1, hist_df = _train(OwnBaseline1,'OwnBaseline1','937,834',False,df=hist_df)
print('\nOwnBaseline2')
train_loss2, train_acc2, val_loss2, val_acc2, epoch_his2, hist_df = _train(OwnBaseline2,'OwnBaseline2','2,917,482', False,df=hist_df)
print('\nOwnBaseline3')
train_loss3, train_acc3, val_loss3, val_acc3, epoch_his3, hist_df = _train(OwnBaseline3,'OwnBaseline3','4,801,578',df=hist_df)
OwnBaseline1 - [Epoch 10/80] | Train Loss: 0.170| Train Accuracy: 93.74 | Val Loss: 0.195 | Val Accuracy: 93.17 | Est: 29.29s - [Epoch 20/80] | Train Loss: 0.119| Train Accuracy: 95.65 | Val Loss: 0.191 | Val Accuracy: 93.34 | Est: 27.70s - [Epoch 30/80] | Train Loss: 0.090| Train Accuracy: 96.62 | Val Loss: 0.189 | Val Accuracy: 94.17 | Est: 27.99s - [Epoch 40/80] | Train Loss: 0.073| Train Accuracy: 97.258 | Val Loss: 0.216 | Val Accuracy: 93.46 | Est: 29.27s - [Epoch 50/80] | Train Loss: 0.059| Train Accuracy: 97.87 | Val Loss: 0.221 | Val Accuracy: 93.78 | Est: 27.11s EarlyStopper triggered at epochs: 53 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 94.32 @ epoch 35 | Lowest Val Loss: 0.17969967871904374 @ epoch 16 OwnBaseline2 - [Epoch 10/80] | Train Loss: 0.153| Train Accuracy: 94.346 | Val Loss: 0.204 | Val Accuracy: 92.53 | Est: 44.54s - [Epoch 20/80] | Train Loss: 0.093| Train Accuracy: 96.532 | Val Loss: 0.219 | Val Accuracy: 92.76 | Est: 43.32s - [Epoch 30/80] | Train Loss: 0.055| Train Accuracy: 98.042 | Val Loss: 0.196 | Val Accuracy: 94.0 | Est: 43.30s - [Epoch 40/80] | Train Loss: 0.040| Train Accuracy: 98.534 | Val Loss: 0.218 | Val Accuracy: 94.37 | Est: 43.60s - [Epoch 50/80] | Train Loss: 0.032| Train Accuracy: 98.822 | Val Loss: 0.237 | Val Accuracy: 94.44 | Est: 44.31s EarlyStopper triggered at epochs: 53 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 94.68 @ epoch 35 | Lowest Val Loss: 0.17212315350770951 @ epoch 15 OwnBaseline3 - [Epoch 10/80] | Train Loss: 0.294| Train Accuracy: 89.626 | Val Loss: 0.512 | Val Accuracy: 82.74 | Est: 42.14s - [Epoch 20/80] | Train Loss: 0.187| Train Accuracy: 93.45 | Val Loss: 0.360 | Val Accuracy: 86.78 | Est: 42.00s - [Epoch 30/80] | Train Loss: 0.128| Train Accuracy: 95.504 | Val Loss: 0.294 | Val Accuracy: 90.0 | Est: 41.50s - [Epoch 40/80] | Train Loss: 0.095| Train Accuracy: 96.706 | Val Loss: 0.245 | Val Accuracy: 92.57 | Est: 41.73s - [Epoch 50/80] | Train Loss: 0.077| Train Accuracy: 97.392 | Val Loss: 0.293 | Val Accuracy: 91.9 | Est: 40.94s EarlyStopper triggered at epochs: 58 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 92.57 @ epoch 40 | Lowest Val Loss: 0.24525991082191467 @ epoch 40
print('ResNet4Blocks')
train_loss4, train_acc4, val_loss4, val_acc4, epoch_his4, hist_df = _train(ResNet4Blocks,'ResNet4Blocks','8,028,618',df=hist_df)
print('\nResNet3Blocks')
train_loss5, train_acc5, val_loss5, val_acc5, epoch_his5, hist_df = _train(ResNet3Blocks,'ResNet3Blocks','2,258,890',False,df=hist_df)
ResNet4Blocks - [Epoch 10/80] | Train Loss: 0.188| Train Accuracy: 93.0 | Val Loss: 0.323 | Val Accuracy: 89.02 | Est: 42.69s - [Epoch 20/80] | Train Loss: 0.132| Train Accuracy: 94.948 | Val Loss: 0.293 | Val Accuracy: 90.65 | Est: 42.68s - [Epoch 30/80] | Train Loss: 0.086| Train Accuracy: 96.754 | Val Loss: 0.343 | Val Accuracy: 89.39 | Est: 43.59s - [Epoch 40/80] | Train Loss: 0.061| Train Accuracy: 97.724 | Val Loss: 0.367 | Val Accuracy: 90.9 | Est: 42.57s EarlyStopper triggered at epochs: 40 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 91.33 @ epoch 22 | Lowest Val Loss: 0.24965646266937255 @ epoch 16 ResNet3Blocks - [Epoch 10/80] | Train Loss: 0.124| Train Accuracy: 95.376 | Val Loss: 0.301 | Val Accuracy: 90.37 | Est: 39.05s - [Epoch 20/80] | Train Loss: 0.055| Train Accuracy: 97.922 | Val Loss: 0.376 | Val Accuracy: 91.07 | Est: 38.34s - [Epoch 30/80] | Train Loss: 0.024| Train Accuracy: 99.12 | Val Loss: 0.483 | Val Accuracy: 90.94 | Est: 39.86s - [Epoch 40/80] | Train Loss: 0.025| Train Accuracy: 99.142 | Val Loss: 0.551 | Val Accuracy: 90.75 | Est: 39.41s - [Epoch 50/80] | Train Loss: 0.009| Train Accuracy: 99.696 | Val Loss: 0.521 | Val Accuracy: 91.69 | Est: 38.72s - [Epoch 60/80] | Train Loss: 0.010| Train Accuracy: 99.64 | Val Loss: 0.536 | Val Accuracy: 91.69 | Est: 38.69s EarlyStopper triggered at epochs: 68 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 91.69 @ epoch 50 | Lowest Val Loss: 0.2535449489951134 @ epoch 7
print('DenseNet4Blocks')
train_loss6, train_acc6, val_loss6, val_acc6, epoch_his6, hist_df = _train(DenseNet4Blocks,'DenseNet4Blocks','1,253,276',False,df=hist_df)
print('DenseNet3Blocks')
train_loss7, train_acc7, val_loss7, val_acc7, epoch_his7, hist_df = _train(DenseNet3Blocks,'DenseNet3Blocks','1,122,830',False,df=hist_df)
DenseNet4Blocks - [Epoch 10/80] | Train Loss: 0.168| Train Accuracy: 93.736 | Val Loss: 0.316 | Val Accuracy: 88.89 | Est: 88.69s - [Epoch 20/80] | Train Loss: 0.079| Train Accuracy: 97.014 | Val Loss: 0.400 | Val Accuracy: 88.35 | Est: 88.31s - [Epoch 30/80] | Train Loss: 0.048| Train Accuracy: 98.188 | Val Loss: 0.486 | Val Accuracy: 89.05 | Est: 89.72s - [Epoch 40/80] | Train Loss: 0.026| Train Accuracy: 99.024 | Val Loss: 0.536 | Val Accuracy: 89.53 | Est: 87.70s - [Epoch 50/80] | Train Loss: 0.016| Train Accuracy: 99.456 | Val Loss: 0.623 | Val Accuracy: 89.23 | Est: 88.20s EarlyStopper triggered at epochs: 54 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 90.11 @ epoch 36 | Lowest Val Loss: 0.3052383542060852 @ epoch 9 DenseNet3Blocks - [Epoch 10/80] | Train Loss: 0.165| Train Accuracy: 93.874 | Val Loss: 0.296 | Val Accuracy: 89.87 | Est: 122.05s - [Epoch 20/80] | Train Loss: 0.069| Train Accuracy: 97.556 | Val Loss: 0.414 | Val Accuracy: 89.12 | Est: 123.01s EarlyStopper triggered at epochs: 28 *No improvement to validation loss and accuracy could be seen for the past 18 epochs Highest Val Accuracy: 89.87 @ epoch 10 | Lowest Val Loss: 0.2958538427948952 @ epoch 10
def classification_rep(model,criterion,loader,train_loss, train_acc, val_loss, val_acc, epoch_his ,title, plot=True):
print(title)
if plot:
plot_history(train_loss, train_acc, val_loss, val_acc, epoch_his,title)
y_pred = []
y_true = []
model.eval()
for i, data in enumerate(loader):
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
y_pred.extend(outputs.argmax(-1).tolist())
y_true.extend(labels.argmax(-1).tolist())
model.train()
print(classification_report(y_true, y_pred,target_names=class_labels))
s = sns.heatmap(confusion_matrix(y_true, y_pred), cmap='Blues', annot=True, fmt='d',
xticklabels=class_labels,
yticklabels=class_labels)
s.set_xlabel('Predicted', fontsize=16)
s.set_ylabel('Actual', fontsize=16)
plt.show()
return
display(hist_df.iloc[0:3])
classification_rep(OwnBaseline1,criterion,valloader,train_loss1, train_acc1, val_loss1, val_acc1, epoch_his1,'OwnBaseline1')
classification_rep(OwnBaseline2,criterion,valloader,train_loss2, train_acc2, val_loss2, val_acc2, epoch_his2,'OwnBaseline2')
classification_rep(OwnBaseline3,criterion,valloader,train_loss3, train_acc3, val_loss3, val_acc3, epoch_his3,'OwnBaseline3')
| Lowest Val Loss | Highest Val Acc | Lowest Train Loss | Highest Train Acc | Epoch (Highest Val Acc) | Model Description | Parameter | |
|---|---|---|---|---|---|---|---|
| 0 | 0.179700 | 94.32 | 0.054980 | 98.004 | 35 | OwnBaseline1 | 937,834 |
| 0 | 0.172123 | 94.68 | 0.028777 | 98.968 | 35 | OwnBaseline2 | 2,917,482 |
| 0 | 0.245260 | 92.57 | 0.053536 | 98.206 | 40 | OwnBaseline3 | 4,801,578 |
OwnBaseline1
precision recall f1-score support
T_shirt/top 0.91 0.86 0.89 1001
Trouser 1.00 0.99 0.99 996
Pullover 0.92 0.91 0.92 1004
Dress 0.95 0.92 0.93 1017
Coat 0.90 0.92 0.91 997
Sandal 0.99 0.99 0.99 971
Shirt 0.80 0.87 0.83 986
Sneaker 0.98 0.97 0.97 1003
Bag 0.98 0.99 0.99 1017
Ankle boot 0.97 0.98 0.98 1008
accuracy 0.94 10000
macro avg 0.94 0.94 0.94 10000
weighted avg 0.94 0.94 0.94 10000
OwnBaseline2
precision recall f1-score support
T_shirt/top 0.89 0.93 0.91 1001
Trouser 1.00 0.99 0.99 996
Pullover 0.92 0.92 0.92 1004
Dress 0.93 0.95 0.94 1017
Coat 0.88 0.92 0.90 997
Sandal 0.99 0.99 0.99 971
Shirt 0.89 0.79 0.84 986
Sneaker 0.96 0.99 0.98 1003
Bag 0.99 0.99 0.99 1017
Ankle boot 0.99 0.96 0.98 1008
accuracy 0.94 10000
macro avg 0.94 0.94 0.94 10000
weighted avg 0.94 0.94 0.94 10000
OwnBaseline3
precision recall f1-score support
T_shirt/top 0.91 0.80 0.85 1001
Trouser 1.00 0.97 0.99 996
Pullover 0.91 0.86 0.88 1004
Dress 0.92 0.91 0.92 1017
Coat 0.84 0.87 0.86 997
Sandal 0.98 0.98 0.98 971
Shirt 0.71 0.85 0.77 986
Sneaker 0.96 0.97 0.97 1003
Bag 0.99 0.97 0.98 1017
Ankle boot 0.97 0.97 0.97 1008
accuracy 0.92 10000
macro avg 0.92 0.92 0.92 10000
weighted avg 0.92 0.92 0.92 10000
From the three VGGNet inspired baseline models, OwnBaseline2 performs the best in terms of accuracy, but shows some form of overfitting over time. Most models seem to be overfitting to the train data too well as seen by their very high train accuracy and loss, OwnBaseline1 which is the simplest and lowest parameter count
From the classification report, it is easy to tell that shirt is the most commonly misidentified class is shirt, it has the lowest precision and recall for all 3 OwnBaseline models, and when we check the confusion matrix we can easily tell that shirt is commonly misidentified with T_shirt/top, vice visa. This is not surprising, afterall T_shirt and shirt are both shirts and is rather hard to identify even for me during our EDA earlier.
I will select OwnBaseline2 and use it for model improvement since it provided the best highest val acc result
display(hist_df.iloc[3:5])
classification_rep(ResNet4Blocks,criterion,valloader,train_loss4, train_acc4, val_loss4, val_acc4, epoch_his4,'ResNet4Blocks')
classification_rep(ResNet3Blocks,criterion,valloader,train_loss5, train_acc5, val_loss5, val_acc5, epoch_his5,'ResNet3Blocks')
| Lowest Val Loss | Highest Val Acc | Lowest Train Loss | Highest Train Acc | Epoch (Highest Val Acc) | Model Description | Parameter | |
|---|---|---|---|---|---|---|---|
| 0 | 0.249656 | 91.33 | 0.061010 | 97.724 | 22 | ResNet4Blocks | 8,028,618 |
| 0 | 0.253545 | 91.69 | 0.007266 | 99.740 | 50 | ResNet3Blocks | 2,258,890 |
ResNet4Blocks
precision recall f1-score support
T_shirt/top 0.85 0.85 0.85 1001
Trouser 1.00 0.98 0.99 996
Pullover 0.88 0.84 0.86 1004
Dress 0.88 0.94 0.91 1017
Coat 0.85 0.82 0.84 997
Sandal 0.98 0.98 0.98 971
Shirt 0.74 0.77 0.75 986
Sneaker 0.96 0.96 0.96 1003
Bag 0.99 0.98 0.98 1017
Ankle boot 0.97 0.97 0.97 1008
accuracy 0.91 10000
macro avg 0.91 0.91 0.91 10000
weighted avg 0.91 0.91 0.91 10000
ResNet3Blocks
precision recall f1-score support
T_shirt/top 0.89 0.80 0.84 1001
Trouser 0.98 0.99 0.99 996
Pullover 0.88 0.82 0.85 1004
Dress 0.94 0.89 0.91 1017
Coat 0.82 0.85 0.83 997
Sandal 0.98 0.97 0.97 971
Shirt 0.67 0.80 0.73 986
Sneaker 0.95 0.97 0.96 1003
Bag 0.98 0.97 0.98 1017
Ankle boot 0.98 0.96 0.97 1008
accuracy 0.90 10000
macro avg 0.91 0.90 0.90 10000
weighted avg 0.91 0.90 0.90 10000
ResNet is more robust compared to something like my pervious Baseline model which was inspired by VGGNet and its model complexity has done more harm then good as it has a very good training accuracy (97%-99.5%). However, a very low validation accuracy. With much higher variance we can see that more shirts are misidentified as T_shirts compare to the simpler models above.
display(hist_df.iloc[5:])
classification_rep(DenseNet4Blocks,criterion,valloader,train_loss6, train_acc6, val_loss6, val_acc6, epoch_his6,'DenseNet4Blocks')
classification_rep(DenseNet3Blocks,criterion,valloader,train_loss7, train_acc7, val_loss7, val_acc7, epoch_his7,'DenseNet3Blocks')
| Lowest Val Loss | Highest Val Acc | Lowest Train Loss | Highest Train Acc | Epoch (Highest Val Acc) | Model Description | Parameter | |
|---|---|---|---|---|---|---|---|
| 0 | 0.305238 | 90.11 | 0.014231 | 99.514 | 36 | DenseNet4Blocks | 1,253,276 |
| 0 | 0.295854 | 89.87 | 0.039509 | 98.612 | 10 | DenseNet3Blocks | 1,122,830 |
DenseNet4Blocks
precision recall f1-score support
T_shirt/top 0.84 0.84 0.84 1001
Trouser 0.99 0.98 0.99 996
Pullover 0.89 0.76 0.82 1004
Dress 0.88 0.92 0.90 1017
Coat 0.75 0.89 0.81 997
Sandal 0.95 0.98 0.96 971
Shirt 0.74 0.70 0.72 986
Sneaker 0.96 0.93 0.94 1003
Bag 0.98 0.98 0.98 1017
Ankle boot 0.97 0.97 0.97 1008
accuracy 0.89 10000
macro avg 0.90 0.89 0.89 10000
weighted avg 0.90 0.89 0.89 10000
DenseNet3Blocks
precision recall f1-score support
T_shirt/top 0.83 0.85 0.84 1001
Trouser 0.99 0.99 0.99 996
Pullover 0.78 0.86 0.82 1004
Dress 0.91 0.90 0.90 1017
Coat 0.79 0.82 0.81 997
Sandal 0.99 0.96 0.98 971
Shirt 0.77 0.64 0.70 986
Sneaker 0.96 0.92 0.94 1003
Bag 0.97 0.97 0.97 1017
Ankle boot 0.92 0.99 0.95 1008
accuracy 0.89 10000
macro avg 0.89 0.89 0.89 10000
weighted avg 0.89 0.89 0.89 10000
A similar result is seen here with DenseNet. Extreme overfitting is seen as a huge varaince is seen between the training and validation accuracy and loss. Many shirts and T_shirts are also misidientified.
The problem with robust architecture and technology: Even though I have tried to lower ResNet and DenseNet complexity it seems to still overfit even with a high weight decay, while they both uses interesting and impressive architecture and technologies, it did much more harm then good. After all, these architecture are built for classifing much higher resolution image (224x224) with more channels and also used to classify 1000 different classes with 1.2 million images. ResNet and DenseNet were never built with the intend of classifying only 10 classes of single channel and low resolution images with relatively low number of dataset, they were meant to overcome vanishing gradient problems and increase robustness and complexity. However, vanishing gradient and the need to increase model complexity was never a problem in the first place for FashionMNIST classification. Therefore, I have learnt that sometimes the best model out there are actually quite simple and not so fancy especially for a problem like this.
del train_loss1, train_acc1, val_loss1, val_acc1, epoch_his1, train_loss2, train_acc2, val_loss2, val_acc2, epoch_his2, train_loss3, train_acc3, val_loss3, val_acc3, epoch_his3, train_loss4, train_acc4, val_loss4, val_acc4, epoch_his4, train_loss5, train_acc5, val_loss5, val_acc5, epoch_his5, train_loss6, train_acc6, val_loss6, val_acc6, epoch_his6, train_loss7, train_acc7, val_loss7, val_acc7, epoch_his7, DenseNet4Blocks, DenseNet3Blocks, ResNet4Blocks, ResNet3Blocks, OwnBaseline1,OwnBaseline2,OwnBaseline3 #, train_loss8, train_acc8, val_loss8, val_acc8, epoch_his8
Models we are tuning:
Augmentation:
Parameters I am tuning:
SGD with momentum will be used for hyperparameter tuning: Though much less efficient compared to Adam, SGD slightly generalize better than Adam, therefore I will hyperparameter tune using SGD instate of constantly switching optimizers. Furthermore, the values of the best hyperparameters for Adam and SGD may be very different that I have to search a very wide range of Learning Rate and Weight Decay, which may take a very long time.
| Split | Size | Description |
|---|---|---|
| Trainset 1 | 50K | Normalization (min-max) |
| Trainset 2 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomErasing() |
| Trainset 3 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomRotation(-5,5) |
| Trainset 4 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomErasing(), RandomRotation(5) |
| Trainset 5 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomErasing(), RandomRotation(-5,5) |
| Trainset 6 (Replicated Data) | 50K + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomPerspective(0.115,0.4), RandomErasing() |
Reinitialize OwnBaseline1 & OwnBaseline2
trainloader2 = DataLoader(TensorDataset(trainset2.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
trainloader3 = DataLoader(TensorDataset(trainset3.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
trainloader4 = DataLoader(TensorDataset(trainset4.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
trainloader5 = DataLoader(TensorDataset(trainset5.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
trainloader6 = DataLoader(TensorDataset(trainset6.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
New dataframe to log our results
df_augmentation = pd.DataFrame([],columns=['Lowest Val Loss','Highest Val Acc','Lowest Train Loss','Epoch (Highest Val Acc)', 'Model Description', 'Parameter'])
Training the different types of augmentation sets
for i in range(2):
OwnBaseline2_1 = SimpleNet2([2,2,2],1,32,2,512)
OwnBaseline2_2 = SimpleNet2([2,2,2],1,32,2,512)
OwnBaseline2_3 = SimpleNet2([2,2,2],1,32,2,512)
OwnBaseline2_4 = SimpleNet2([2,2,2],1,32,2,512)
OwnBaseline2_5 = SimpleNet2([2,2,2],1,32,2,512)
OwnBaseline2_6 = SimpleNet2([2,2,2],1,32,2,512)
train_loss1, train_acc1, val_loss1, val_acc1, epoch_his1, df_augmentation = _train(OwnBaseline2_1,'OwnBaseline2-No Aug','2,917,482',False,df=df_augmentation,showEpochs=False,trainloader=trainloader1)
train_loss2, train_acc2, val_loss2, val_acc2, epoch_his2, df_augmentation = _train(OwnBaseline2_2,'OwnBaseline2-HorizontalFlip(0.8),RandomErasing','2,917,482',False,df=df_augmentation,showEpochs=False,trainloader=trainloader2)
train_loss3, train_acc3, val_loss3, val_acc3, epoch_his3, df_augmentation = _train(OwnBaseline2_3,'OwnBaseline2-RandomHorizontalFlip(0.8),RandomRotation(-5,5)','2,917,482',False,df=df_augmentation,showEpochs=False,trainloader=trainloader3)
train_loss4, train_acc4, val_loss4, val_acc4, epoch_his4, df_augmentation = _train(OwnBaseline2_4,'OwnBaseline2-RandomHorizontalFlip(0.8),RandomErasing(),RandomRotation(-5,5)','2,917,482',False,df=df_augmentation,showEpochs = False,trainloader=trainloader4)
train_loss5, train_acc5, val_loss5, val_acc5, epoch_his5, df_augmentation = _train(OwnBaseline2_5,'OwnBaseline2-RandomErasing(), RandomRotation(-5,5)','2,917,482',False,df=df_augmentation,showEpochs = False,trainloader=trainloader5)
train_loss6, train_acc6, val_loss6, val_acc6, epoch_his6, df_augmentation = _train(OwnBaseline2_6,'OwnBaseline2-RandomHorizontalFlip(0.8),RandomErasing(),RandomPerspective(0.125,0.375)','2,917,482',False,df=df_augmentation,showEpochs = False,trainloader=trainloader6)
clear_output(wait=True)
df_augmentation
| Lowest Val Loss | Highest Val Acc | Lowest Train Loss | Epoch (Highest Val Acc) | Model Description | Parameter | Highest Train Acc | |
|---|---|---|---|---|---|---|---|
| 0 | 0.184186 | 94.51 | 0.022331 | 47 | OwnBaseline2-No Aug | 2,917,482 | 99.204 |
| 0 | 0.167996 | 94.87 | 0.031395 | 36 | OwnBaseline2-HorizontalFlip(0.8),RandomErasing | 2,917,482 | 98.867 |
| 0 | 0.168978 | 95.16 | 0.020353 | 69 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomR... | 2,917,482 | 99.281 |
| 0 | 0.163787 | 95.13 | 0.025544 | 58 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomE... | 2,917,482 | 99.093 |
| 0 | 0.170332 | 94.65 | 0.033518 | 20 | OwnBaseline2-RandomErasing(), RandomRotation(-... | 2,917,482 | 98.800 |
| 0 | 0.172038 | 94.80 | 0.030348 | 42 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomE... | 2,917,482 | 98.925 |
| 0 | 0.180670 | 94.71 | 0.017113 | 76 | OwnBaseline2-No Aug | 2,917,482 | 99.388 |
| 0 | 0.169142 | 95.04 | 0.026290 | 48 | OwnBaseline2-HorizontalFlip(0.8),RandomErasing | 2,917,482 | 99.073 |
| 0 | 0.162305 | 95.17 | 0.023202 | 51 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomR... | 2,917,482 | 99.157 |
| 0 | 0.166328 | 95.21 | 0.024472 | 59 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomE... | 2,917,482 | 99.126 |
| 0 | 0.173721 | 94.72 | 0.024363 | 35 | OwnBaseline2-RandomErasing(), RandomRotation(-... | 2,917,482 | 99.130 |
| 0 | 0.167446 | 94.85 | 0.047426 | 19 | OwnBaseline2-RandomHorizontalFlip(0.8),RandomE... | 2,917,482 | 98.248 |
Seems like we are pretty confident that augmentation improves the model validation loss and accuracy. However, to choose the data augmentation set, I will evaluate no augmentation set with the top 2 augmentation sets.
Evaluating 3 sets:
| Split | Size | Description | Average Val Acc |
|---|---|---|---|
| Trainset 1 | 50K | Normalization (min-max) | 94.61% |
| Trainset 3 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomRotation(-5,5) | 95.165% |
| Trainset 4 (Replicated Data) | 50k + 50k | Normalization (min-max), RandomHorizontalFlip(0.8), RandomErasing(), RandomRotation(5) | 95.17% |
display(df_augmentation.iloc[-6])
classification_rep(OwnBaseline2_1,criterion,valloader,train_loss1, train_acc1, val_loss1, val_acc1, epoch_his1,'No Augmentation')
display(df_augmentation.iloc[-3])
classification_rep(OwnBaseline2_2,criterion,valloader,train_loss2, train_acc2, val_loss2, val_acc2, epoch_his2,'RandomHorizontalFlip(0.8),RandomRotation(-5,5)')
display(df_augmentation.iloc[-4])
classification_rep(OwnBaseline2_6,criterion,valloader,train_loss6, train_acc6, val_loss6, val_acc6, epoch_his6,'RandomHorizontalFlip(0.8),RandomPerspective(),RandomErasing')
Lowest Val Loss 0.18067 Highest Val Acc 94.71 Lowest Train Loss 0.017113 Epoch (Highest Val Acc) 76 Model Description OwnBaseline2-No Aug Parameter 2,917,482 Highest Train Acc 99.388 Name: 0, dtype: object
No Augmentation
precision recall f1-score support
T_shirt/top 0.92 0.89 0.90 1001
Trouser 1.00 0.99 0.99 996
Pullover 0.93 0.92 0.92 1004
Dress 0.94 0.94 0.94 1017
Coat 0.89 0.92 0.91 997
Sandal 1.00 0.99 0.99 971
Shirt 0.83 0.84 0.84 986
Sneaker 0.97 0.98 0.98 1003
Bag 0.99 0.99 0.99 1017
Ankle boot 0.98 0.98 0.98 1008
accuracy 0.94 10000
macro avg 0.94 0.94 0.94 10000
weighted avg 0.94 0.94 0.94 10000
Lowest Val Loss 0.166328 Highest Val Acc 95.21 Lowest Train Loss 0.024472 Epoch (Highest Val Acc) 59 Model Description OwnBaseline2-RandomHorizontalFlip(0.8),RandomE... Parameter 2,917,482 Highest Train Acc 99.126 Name: 0, dtype: object
RandomHorizontalFlip(0.8),RandomRotation(-5,5)
precision recall f1-score support
T_shirt/top 0.90 0.91 0.91 1001
Trouser 0.99 0.99 0.99 996
Pullover 0.90 0.94 0.92 1004
Dress 0.94 0.95 0.95 1017
Coat 0.93 0.90 0.92 997
Sandal 0.99 0.99 0.99 971
Shirt 0.87 0.84 0.86 986
Sneaker 0.97 0.99 0.98 1003
Bag 0.99 0.99 0.99 1017
Ankle boot 0.99 0.97 0.98 1008
accuracy 0.95 10000
macro avg 0.95 0.95 0.95 10000
weighted avg 0.95 0.95 0.95 10000
Lowest Val Loss 0.162305 Highest Val Acc 95.17 Lowest Train Loss 0.023202 Epoch (Highest Val Acc) 51 Model Description OwnBaseline2-RandomHorizontalFlip(0.8),RandomR... Parameter 2,917,482 Highest Train Acc 99.157 Name: 0, dtype: object
RandomHorizontalFlip(0.8),RandomPerspective(),RandomErasing
precision recall f1-score support
T_shirt/top 0.95 0.86 0.90 1001
Trouser 1.00 0.99 0.99 996
Pullover 0.92 0.94 0.93 1004
Dress 0.95 0.95 0.95 1017
Coat 0.92 0.91 0.92 997
Sandal 1.00 0.99 0.99 971
Shirt 0.81 0.89 0.85 986
Sneaker 0.98 0.98 0.98 1003
Bag 0.99 0.99 0.99 1017
Ankle boot 0.98 0.98 0.98 1008
accuracy 0.95 10000
macro avg 0.95 0.95 0.95 10000
weighted avg 0.95 0.95 0.95 10000
Since augementation set 3&4 are good and differ less than 0.1%, and their learning curve looks roughly similar. I will pick the augmentation set with the least wrongly predicted Shirt/T_shirt class, as from our pervious EDA we noted that Shirt/T_shirt are the ones that are generally misidentified.
trainloader4 (100K samples - HorizontalFlip(0.8),RandomRotation(-5,5),RandomErasing()) will be used during hyperparameter tuning.
del trainloader2, trainloader3, trainloader5, trainloader6, trainset1, trainset2, trainset3, trainset5, trainset6, traindata_label, OwnBaseline2_1,OwnBaseline2_2,OwnBaseline2_3,OwnBaseline2_4,OwnBaseline2_5,OwnBaseline2_6
There are tons of hyperparameter to tune. However, I have selected 5 I feel matters the most along with a little bit of research as well.
Hyperparameters I am tuning:
With 5 hyperparameters to already tune it would be very time consuming and computational demanding that it will become impractical to tune all possible hyperparameters that I could think of. However, if computational power and time is not an issue I would definitely hyperparameter tune these as well:
Hyperparameter Technique:
For hyperparameter tuning I will built a custom random hyperparameter tuner, the technique used to find the most optimal hyperparameters for my baseline model would be coarse to fine (double random hypertuning) suggested by Andrew Ng. Where for my first random hyperparameter search I will search a fairly wide range of value and then I will do a 2nd random hyperparameter search where I can cut down on the range of hyperparameter values and search a subset of where my most optimal hyperparameters will most likely be at based on where my top few highest validation acc/loss are. For the image example below, there is 2 hand drawn boxes done by Andrew Ng, the outter hand drawn box would be the first random search, and the inner box would be the 2nd random search.

Importing the SAM optimizer for pytorch [Foret et al.,2021]
!git clone https://github.com/davda54/sam.git
Cloning into 'sam'... remote: Enumerating objects: 200, done. remote: Counting objects: 100% (96/96), done. remote: Compressing objects: 100% (36/36), done. remote: Total 200 (delta 73), reused 66 (delta 60), pack-reused 104 Receiving objects: 100% (200/200), 659.65 KiB | 21.28 MiB/s, done. Resolving deltas: 100% (95/95), done.
from sam.sam import SAM
Preping trainloader - removing irrelavent trainsets from memory
BATCH_SIZE=512
trainloader4 = DataLoader(TensorDataset(trainset4.type('torch.FloatTensor'),traindata_label_100k.type('torch.FloatTensor')), shuffle=True, batch_size=BATCH_SIZE)
valloader = DataLoader(TensorDataset(valset.type('torch.FloatTensor'),valdata_label.type('torch.FloatTensor')), shuffle=False, batch_size=BATCH_SIZE)
del trainset1, trainset2, trainset3, trainset5, trainset6
lrepoch=lrinitial∗Lambda(epoch)
See graph for more infomartion - 100 epochs example (epoch and learning rate decay overtime)
epoch=100
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
lambda1 = lambda epoch: 0.97 ** epoch
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)
lrs = []
for i in range(epoch):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
scheduler.step()
plt.xlabel('epochs')
plt.ylabel('learning rate')
plt.plot(range(epoch),lrs)
plt.show()
Since we will be running 100 epochs I will set a slightly lower earlystopper
# init early stopper with patience of 15
earlyStopper = EarlyStopping(15)
SAM optimizer uses a double forward-backward pass, so I have to edit the train function for SAM optimizer, for resources and documentation see Original Reseacher's Github
def train_sam(model, loader, optimizer, NUM_EPOCHS, criterion, valloader=None,earlystopper=None,showEpoch=True,scheduler=scheduler):
val_loss_his = []
val_accuracy_his = []
epoch_his = []
earlystopper.reset()
for epoch in range(NUM_EPOCHS):
t0 = time.time()
for data in loader:
img, label = data
img = img.to(device)
label = label.to(device)
outputs = model(img)
loss = criterion(outputs, label) # use this loss for any training statistics
loss.backward()
optimizer.first_step(zero_grad=True)
outputs = model(img)
# second forward-backward pass
criterion(outputs, label).backward() # make sure to do a full forward pass
optimizer.second_step(zero_grad=True)
scheduler.step()
val_loss, val_accuracy = validate(model, criterion, valloader)
val_loss_his.append(val_loss)
val_accuracy_his.append(val_accuracy)
epoch_his.append(epoch+1)
if showEpoch:
print(f'- [Epoch {epoch+1}/{NUM_EPOCHS}] |_|_| Val Loss: {val_loss:.3f} | Val Accuracy: {val_accuracy} | Est: {time.time() - t0:.2f}s')
if earlyStopper(val_accuracy,val_loss):
print(f'EarlyStopper triggered at epochs: {epoch+1} \n*No improvement to validation loss and accuracy could be seen for the pass {earlyStopper.patience} epochs')
break
print(f'Highest Val Accuracy: {max(val_accuracy_his)} @ epoch {epoch_his[val_accuracy_his.index(max(val_accuracy_his))]} | Lowest Val Loss: {min(val_loss_his)} @ epoch {epoch_his[val_loss_his.index(min(val_loss_his))]}')
return _,_, val_loss_his, val_accuracy_his,epoch_his
This function cache my hyperparameter tuning state to a text file. Since I usually hypertune on different systems I think its nice to write a feature like this...Therefore, I can easily stop my hyperparameter tuning halfway and continue at a later time. It also serves as a backup for the trialed hyperparameters if my system ever crashes.
def loadTunerState(stateData = 'hyperparmState.csv'):
with open(os.path.join(sys.path[0], stateData),'r') as file:
highest_val_acc = []
val_acc_hist = []
trial_hist = []
total_time = 0
rawTxt = file.read()
splited = rawTxt.split('\n')
for val in splited[0].split(','):
highest_val_acc.extend([float(val)])
for val in splited[1].split(','):
val_acc_hist.extend([float(val)])
c = 0
trial = []
for val in splited[2].split(','):
c += 1
trial.extend([float(val)])
if c%4==0:
trial_hist.append(trial)
trial = []
total_time = float(splited[3])
saved_time = float(splited[4])
file.close()
return highest_val_acc, val_acc_hist, trial_hist, total_time, saved_time
def saveTunerState(highest_val_acc,val_acc_hist,trial_hist,stoppedTime,stateData = 'hyperparmState.csv'):
inp = ''
with open(os.path.join(sys.path[0], stateData),'w') as file:
for val in highest_val_acc:
inp += f'{val},'
file.write(inp[:-1]+'\n')
inp = ''
for val in val_acc_hist:
inp += f'{val},'
file.write(inp[:-1]+'\n')
inp = ''
for val in trial_hist:
for val2 in val:
inp += f'{val2},'
file.write(inp[:-1]+'\n'+str(stoppedTime)+'\n'+str(time.time()))
file.close()
return
Function for my first random search
def NetRandomTuner(LR_range = np.logspace(1,2,num=10)/1000 * 0.4,WD_range=np.logspace(1,5,num=10)/1000000,Beta_range=np.linspace(0.9,0.96, num=6),layer_range=np.arange(2,4), optimizer_choice=np.array(['SGD','SAM']),trials=50,epoch=80):
possible_trials=[]
highest_val_acc = [0,0,0]
val_acc_hist = []
trial_hist = []
for LR, WD, B3, M, O in itertools.product(*(LR_range,WD_range,Beta_range,layer_range, optimizer_choice)):
possible_trials.append([LR, WD, B3, M, O])
# shuffle all possible trials
random.shuffle(possible_trials)
trial_count = 0
t0 = time.time()
for trial in possible_trials:
print(trial)
t1 = time.time()
if trial_count == trials:
print(f'\n\nTrial ended at trial #{trial_count}')
break
trial_count += 1
# A lot more options to tune if you want. Refer to section 4.2 discussion (growthrate,filters,neurons)
model = SimpleNet2([2,2,trial[3]],1,32,2,512)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model = model.to(device)
else:
model = model.to(device)
if trial[4] == 'SGD':
opt = optim.SGD(model.parameters(),lr=trial[0], weight_decay=trial[1], momentum=trial[2])
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda=lambda1)
train_loss, train_acc, val_loss, val_acc, epoch_his = train(model, trainloader4,opt,epoch,criterion,valloader,earlyStopper,True,scheduler)
else:
opt_base = optim.SGD
opt = SAM(model.parameters(), opt_base, lr=trial[0], momentum=trial[2], weight_decay=trial[1])
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda=lambda1)
train_loss, train_acc, val_loss, val_acc, epoch_his = train_sam(model, trainloader4,opt,epoch,criterion,valloader,earlyStopper,True,scheduler)
if max(val_acc) > highest_val_acc[0]:
highest_val_acc = [max(val_acc),trial_count-1,epoch_his[val_acc.index(max(val_acc))]]
trial_hist.append([trial[0],trial[1],trial[2],trial[3],trial[4],epoch_his[val_acc.index(max(val_acc))]])
val_acc_hist.extend([max(val_acc)])
clear_output()
print(f'''
Trial #{trial_count} Finished - Search Time {(time.time()-t1)/60:.2f} Mins
Total Time Elapsed: {(time.time()-t0)/60:.2f} Mins\n
Hyperparameters\t\t|Trial Values:
Learning Rate\t\t|{trial[0]:.6f}\t\t|{trial_hist[highest_val_acc[1]][0]:.6f}
Weight Decay (L2)\t|{trial[1]:.6f}\t\t|{trial_hist[highest_val_acc[1]][1]:.6f}
Momentum\t\t|{trial[2]:.6f}\t\t|{trial_hist[highest_val_acc[1]][2]:.6f}
Last Layer No\t\t|{trial[3]}\t\t\t|{trial_hist[highest_val_acc[1]][3]}
Optimizer\t\t|{trial[4]}\t\t\t|{trial_hist[highest_val_acc[1]][4]}\n
Highest Val Acc\t\t|{max(val_acc)}\t\t\t|{highest_val_acc[0]:.2f}
Epoch (Highest Val)\t|{epoch_his[val_acc.index(max(val_acc))]}\t\t\t|{highest_val_acc[2]}\n\n''')
return trial_hist, val_acc_hist
trial_hist1, val_acc_hist1 = NetRandomTuner(epoch=100,trials=75)
Trial #75 Finished - Search Time 10.59 Mins Total Time Elapsed: 917.76 Mins Hyperparameters |Trial Values: #75 |Best Trial Values: #67 Learning Rate |0.011130 |0.005166 Weight Decay (L2) |0.035938 |0.004642 Momentum |0.936000 |0.948000 Last Layer No |3 |2 Optimizer |SGD |SAM Highest Val Acc |94.33 |95.38 Epoch (Highest Val) |67 |64 [0.02397937001275764, 0.03593813663804626, 0.924, 2, 'SGD'] Trial ended at trial #75
now lets analysis our top 5 highest and top 5 lowest validation accuracy hyperparameters
Function to display top N & bottom N highest validation accuracy of the hyperparameters and validation history recorded during random search.
def hyperparmsOverview(val_acc_hist, trial_hist, returnNum = 3):
df_top = pd.DataFrame([],columns=['Highest val acc','LR','Weight decay','Momentum','Last block layers','Optimizer','Epoch (of Highest Val Acc)'])
df_btm = pd.DataFrame([],columns=['Highest val acc','LR','Weight decay','Momentum','Last block layers','Optimizer','Epoch (of Highest Val Acc)'])
top = sorted(zip(val_acc_hist, trial_hist), reverse=True)[:returnNum]
btm = sorted(zip(val_acc_hist, trial_hist), reverse=False)[:returnNum]
for i,e in zip(top,btm):
df_top = pd.concat([df_top,pd.DataFrame([[i[0],i[1][0],i[1][1],i[1][2],i[1][3],i[1][4],i[1][5]]],columns=['Highest val acc','LR','Weight decay','Momentum','Last block layers','Optimizer','Epoch (of Highest Val Acc)'])])
df_btm = pd.concat([df_btm,pd.DataFrame([[e[0],e[1][0],e[1][1],e[1][2],e[1][3],e[1][4],e[1][5]]],columns=['Highest val acc','LR','Weight decay','Momentum','Last block layers','Optimizer','Epoch (of Highest Val Acc)'])])
print(f'Top {returnNum} Highest Val Accuracy Hyperparameters')
display(df_top)
print(f'\n\nBottom {returnNum} Highest Val Accuracy Hyperparameters')
display(df_btm)
hyperparmsOverview(val_acc_hist1, trial_hist1)
Top 3 Highest Val Accuracy Hyperparameters
| Highest val acc | LR | Weight decay | Momentum | Last block layers | Optimizer | Epoch (of Highest Val Acc) | |
|---|---|---|---|---|---|---|---|
| 0 | 95.38 | 0.005166 | 0.004642 | 0.948 | 2 | SAM | 64 |
| 0 | 95.34 | 0.040000 | 0.000215 | 0.924 | 2 | SAM | 70 |
| 0 | 95.31 | 0.023979 | 0.004642 | 0.960 | 2 | SAM | 82 |
Bottom 3 Highest Val Accuracy Hyperparameters
| Highest val acc | LR | Weight decay | Momentum | Last block layers | Optimizer | Epoch (of Highest Val Acc) | |
|---|---|---|---|---|---|---|---|
| 0 | 10.17 | 0.040000 | 0.1 | 0.960 | 2 | SAM | 4 |
| 0 | 48.32 | 0.040000 | 0.1 | 0.936 | 3 | SAM | 2 |
| 0 | 57.02 | 0.030971 | 0.1 | 0.936 | 2 | SAM | 5 |
A collection of analysis is shown bellow:
With these knownledge we can effective cut our 2nd random search field to:
Weight Decay: logscaled range from 0.008-0.00008, logscaled because the affect of lower weight decay change is much more obvious compared to higher values
We can completely elimate optimizer and layer tuning from hyperparameter tuning, since it is obvious SAM optimizer is better than SGD and 2 layers for our last block is better than 3 layers.*
Nice...our number of total combinations of hyperparameters have been reduced drastically. We are searching a much better range of hyperparameters.
def NetRandomTuner2(hyperparameterCacheFile = None,saveAs = 'hyperparmState.csv',cacheEvery=3,LR_range = np.logspace(1,2,num=10)/1000 * 0.4,WD_range=np.logspace(1,3,num=10)/10000 * 0.08,Beta_range=np.linspace(0.924,0.96, num=5),trials=50,epoch=80):
possible_trials=[]
for LR, WD, B3 in itertools.product(*(LR_range,WD_range,Beta_range)):
possible_trials.append([LR, WD, B3])
# shuffle all possible trials
random.shuffle(possible_trials)
if hyperparameterCacheFile == None:
highest_val_acc = [0,0,0]
val_acc_hist = []
trial_hist = []
loaded_trial_hist = []
trial_count = 0
t0 = time.time()
else:
highest_val_acc, val_acc_hist, trial_hist, t0, time_saved = loadTunerState(hyperparameterCacheFile)
# New t0 = time of loading - time of saving
t0 += time.time() - time_saved
trial_count = len(val_acc_hist)
highest_val_acc[1] = int(highest_val_acc[1]) # just making sure is correct datatype
highest_val_acc[2] = int(highest_val_acc[2])
loaded_trial_hist = []
for row in trial_hist:
loaded_trial_hist.append(row[:3])
opt_base = optim.SGD
for trial in possible_trials:
# If trial is already ran from pervious history, skip trial and go to next
if trial in loaded_trial_hist:
pass
else:
print('Next trial: ',trial)
t1 = time.time()
if trial_count == trials:
print(f'\n\nTrial ended at trial #{trial_count}')
break
trial_count += 1
# A lot more options to tune if you want. Refer to section 4.2 discussion (growthrate,filters,neurons)
model = SimpleNet2([2,2,2],1,32,2,512)
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
model = model.to(device)
else:
model = model.to(device)
opt = SAM(model.parameters(), opt_base, lr=trial[0], momentum=trial[2], weight_decay=trial[1])
scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda=lambda1)
_, _, _, val_acc, epoch_his = train_sam(model, trainloader4,opt,epoch,criterion,valloader,earlyStopper,True,scheduler)
if max(val_acc) > highest_val_acc[0]:
highest_val_acc = [max(val_acc),trial_count-1,epoch_his[val_acc.index(max(val_acc))]]
trial_hist.append([trial[0],trial[1],trial[2],epoch_his[val_acc.index(max(val_acc))]])
val_acc_hist.extend([max(val_acc)])
clear_output()
print(f'''
Trial #{trial_count} Finished - Search Time {(time.time()-t1)/60:.2f} Mins
Total Time Elapsed: {(time.time()-t0)/60:.2f} Mins\n
Hyperparameters\t\t|Trial Values: #{trial_count}\t|Best Trial Values: #{highest_val_acc[1]+1}\n
Learning Rate\t\t|{trial[0]:.6f}\t\t|{trial_hist[highest_val_acc[1]][0]:.6f}
Weight Decay (L2)\t|{trial[1]:.6f}\t\t|{trial_hist[highest_val_acc[1]][1]:.6f}
Momentum\t\t|{trial[2]:.6f}\t\t|{trial_hist[highest_val_acc[1]][2]:.6f}
Highest Val Acc\t\t|{max(val_acc)}\t\t\t|{highest_val_acc[0]:.2f}
Epoch (Highest Val)\t|{epoch_his[val_acc.index(max(val_acc))]}\t\t\t|{highest_val_acc[2]}\n\n
''')
if trial_count % cacheEvery == 0:
saveTunerState(highest_val_acc,val_acc_hist,trial_hist,t0,stateData = saveAs)
return trial_hist, val_acc_hist
trial_hist1, val_acc_hist1 = NetRandomTuner2('hyperparmState2.csv',trials=70,epoch=100, saveAs='hyperparmState.csv')
Trial #70 Finished - Search Time 2.30 Mins Total Time Elapsed: 935.70 Mins Hyperparameters |Trial Values: #70 |Best Trial Values: #9 Learning Rate |0.040000 |0.014987 Weight Decay (L2) |0.008000 |0.000619 Momentum |0.951000 |0.960000 Highest Val Acc |91.64 |95.42 Epoch (Highest Val) |15 |65 Next trial: [0.011130237608828497, 0.0017235477520255059, 0.951] Trial ended at trial #70
About 150 trials are done in total @100 max epochs with 15 epochs early stopping, training took a total time of 20+ hours, a faster time could be taken if I sticked to just 1 fast machine (10+ hours perhaps) instead of multiple different machines, my best trial is trial #79/150 too so if I stop my hyperparameter tuning much earlier it would much faster (but of course I won't know when my hyperparameter tuner will reach its highest val accuracy). Nonetheless, the analysis of the secoond random search will be shown below.
def hyperparmsOverview(val_acc_hist, trial_hist, returnNum):
df_top = pd.DataFrame([],columns=['Highest val acc','LR','Weight decay','Momentum','Epoch (of Highest Val Acc)'])
df_btm = pd.DataFrame([],columns=['Highest val acc','LR','Weight decay','Momentum','Epoch (of Highest Val Acc)'])
top = sorted(zip(val_acc_hist, trial_hist), reverse=True)[:returnNum]
btm = sorted(zip(val_acc_hist, trial_hist), reverse=False)[:returnNum]
for i,e in zip(top,btm):
df_top = pd.concat([df_top,pd.DataFrame([[i[0],i[1][0],i[1][1],i[1][2],i[1][3]]],columns=['Highest val acc','LR','Weight decay','Momentum','Epoch (of Highest Val Acc)'])])
df_btm = pd.concat([df_btm,pd.DataFrame([[e[0],e[1][0],e[1][1],e[1][2],e[1][3]]],columns=['Highest val acc','LR','Weight decay','Momentum','Epoch (of Highest Val Acc)'])])
print(f'Top {returnNum} Highest Val Accuracy Hyperparameters')
display(df_top)
print(f'\n\nBottom {returnNum} Highest Val Accuracy Hyperparameters')
display(df_btm)
hyperparmsOverview(val_acc_hist1, trial_hist1,returnNum=5)
Top 5 Highest Val Accuracy Hyperparameters
| Highest val acc | LR | Weight decay | Momentum | Epoch (of Highest Val Acc) | |
|---|---|---|---|---|---|
| 0 | 95.42 | 0.014987 | 0.000619 | 0.960 | 65.0 |
| 0 | 95.37 | 0.010781 | 0.004796 | 0.951 | 44.0 |
| 0 | 95.36 | 0.023979 | 0.004796 | 0.924 | 70.0 |
| 0 | 95.32 | 0.023979 | 0.002875 | 0.933 | 82.0 |
| 0 | 95.31 | 0.010781 | 0.002875 | 0.942 | 49.0 |
Bottom 5 Highest Val Accuracy Hyperparameters
| Highest val acc | LR | Weight decay | Momentum | Epoch (of Highest Val Acc) | |
|---|---|---|---|---|---|
| 0 | 91.64 | 0.040000 | 0.008000 | 0.951 | 15 |
| 0 | 94.03 | 0.014375 | 0.004796 | 0.960 | 12 |
| 0 | 94.39 | 0.005166 | 0.000080 | 0.924 | 71.0 |
| 0 | 94.51 | 0.003000 | 0.000371 | 0.924 | 60.0 |
| 0 | 94.57 | 0.005004 | 0.000371 | 0.933 | 50.0 |
trainset2_rep = DataAugment(train_input,hFlip_RE,50)
trainset_rep = DataAugment(trainset2_rep,randRotate,50)
train_input = torch.cat((train_input,trainset_rep))
train_label = torch.cat((train_label,train_label))
display(train_input.shape)
del trainset2_rep
torch.Size([120000, 1, 28, 28])
trainloader = DataLoader(TensorDataset(train_input.type('torch.FloatTensor'),train_label.type('torch.FloatTensor')), shuffle=True, batch_size=512)
testloader = DataLoader(TensorDataset(testset.type('torch.FloatTensor'),test_label.type('torch.FloatTensor')), shuffle=False, batch_size=512)
Evaluation function:
def eval(model,criterion,test_loader,title):
correct = 0
total = 0
print(title)
y_pred = []
y_true = []
# Torch tensor has to be on 1 GPU device to concat, so I selected GPU:0
cuda0 = torch.device('cuda:0')
wrong_samples = torch.empty(0, device=cuda0)
wrong_preds = torch.empty(0, device=cuda0)
actual_preds = torch.empty(0, device=cuda0)
running_loss = []
model.eval()
for i, data in enumerate(test_loader):
inputs, labels = data
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
y_pred.extend(outputs.argmax(-1).tolist())
y_true.extend(labels.argmax(-1).tolist())
correct += (outputs.argmax(-1) == labels.argmax(-1)).sum().item()
running_loss.append(loss.item())
total += labels.size(0)
wrong_mask = outputs.argmax(-1) != labels.argmax(-1)
wrong_samples = torch.cat((wrong_samples,inputs[wrong_mask]))
wrong_preds = torch.cat((wrong_preds,outputs.argmax(-1)[wrong_mask]))
actual_preds = torch.cat((actual_preds,labels.argmax(-1)[wrong_mask]))
test_accuracy = (100 * correct / total)
test_loss = sum(running_loss) / len(running_loss)
model.train()
print(classification_report(y_true, y_pred,target_names=class_labels))
s = sns.heatmap(confusion_matrix(y_true, y_pred), cmap='Blues', annot=True, fmt='d',
xticklabels=class_labels,
yticklabels=class_labels)
s.set_xlabel('Predicted', fontsize=16)
s.set_ylabel('Actual', fontsize=16)
plt.show()
return test_loss, test_accuracy, wrong_samples, wrong_preds, actual_preds
criterion = nn.CrossEntropyLoss()
SimpleNet = SimpleNet2([2,2,2],1,32,2,512)
opt_base = optim.SGD
if torch.cuda.device_count() > 1:
SimpleNet = nn.DataParallel(SimpleNet)
SimpleNet = SimpleNet.to(device)
else:
SimpleNet = SimpleNet.to(device)
optimizer = SAM(SimpleNet.parameters(), opt_base, lr=0.014987, momentum=0.96, weight_decay=0.000619)
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)
for epoch in range(70):
for data in trainloader:
img, label = data
img = img.to(device)
label = label.to(device)
outputs = SimpleNet(img)
loss = criterion(outputs, label) # use this loss for any training statistics
loss.backward()
optimizer.first_step(zero_grad=True)
outputs = SimpleNet(img)
# second forward-backward pass
criterion(outputs, label).backward() # make sure to do a full forward pass
optimizer.second_step(zero_grad=True)
scheduler.step()
# Evaluation with test set - only return test_accuracy
test_loss, test_accuracy, wrong_samples, wrong_preds, actual_preds = eval(SimpleNet,criterion,testloader,'Final Model')
print('\nFinal model test_accuracy (Top-1 accuracy): ',test_accuracy)
print('\nFinal model test_loss: ',test_loss)
Final Model
precision recall f1-score support
T-shirt/top 0.90 0.92 0.91 1000
Trouser 0.99 0.99 0.99 1000
Pullover 0.93 0.94 0.93 1000
Dress 0.95 0.94 0.95 1000
Coat 0.91 0.94 0.93 1000
Sandal 0.99 0.99 0.99 1000
Shirt 0.86 0.83 0.85 1000
Sneaker 0.97 0.98 0.98 1000
Bag 0.99 0.99 0.99 1000
Ankle boot 0.98 0.97 0.98 1000
accuracy 0.95 10000
macro avg 0.95 0.95 0.95 10000
weighted avg 0.95 0.95 0.95 10000
Final model test_accuracy (Top-1 accuracy): 94.93 Final model test_loss: 0.18866165429353715
| Highest f1-scores | Class | Actual/Predicted |
|---|---|---|
| 0.99+ | Bag | 995/1000 |
| 0.99+ | Trouser | 990/1000 |
| 0.99+ | Sandal | 987/1000 |
| Lowest f1-scores | Class | Actual/Predicted |
|---|---|---|
| 0.85 | Shirt | 826/1000 |
| 0.91 | T_shirt | 915/1000 |
| 0.93 | Coat | 942/1000 |
Seems that the final top-1 accuracy of our model is 94.93% which is about 0.4+% less than the hyperparameter's best validation accuracy, which is not surprising because I would say the validation set is overused many times that it is almost fitting too well to my validation set, even though I didn't use any validation set data as my training data during model improvement, I always choose the best hyperparameters and data augmentation based on the best and highest validation accuracy that it is almost fitting too well on it.
Generally, classes with the highest f1-scores have more differing and distinct features, for example 'bag' does not look like any other classes it is quite distinct, compared to the classes with lower f1-scores such as 'shirt' and 'T_shirt' they can technially be identified as the same object, and they look very similar to one another.
random_idxs = np.random.choice(wrong_preds.shape[0], 20, replace=False)
fig, ax = plt.subplots(4, 5, figsize=(20, 20))
plt.axis("off")
for idx, subplot in zip(random_idxs, ax.ravel()):
pred = class_labels[wrong_preds.type('torch.LongTensor')[idx]]
actual = class_labels[actual_preds.type('torch.LongTensor')[idx]]
subplot.imshow(wrong_samples.cpu().numpy().reshape(wrong_samples.shape[0],28,28)[idx], cmap='gray')
subplot.set_title(f"Label: {actual}, Predicted: {pred}")
Generally, 'T_shirt' are predicted as 'shirt', vice versa, and some of the mistaken prediction is something I could understand. For example, some of the clothes have overlapping features (these features were talked about during EDA #1.2.1) with something like 'Coat', as coats have generally a long sleeve
Getting all the filters for our final model
layers = []
for parm in SimpleNet.parameters():
layers.append(parm)
Visualizating first convolutional layer filter weights (32 filters)
from torchvision import utils
def visTensor(tensor, ch=0, allkernels=False, nrow=8, padding=1):
n,c,w,h = tensor.shape
if allkernels: tensor = tensor.view(n*c, -1, w, h)
elif c != 3: tensor = tensor[:,ch,:,:].unsqueeze(dim=1)
rows = np.min((tensor.shape[0] // nrow + 1, 64))
grid = utils.make_grid(tensor, nrow=nrow, normalize=True, padding=padding)
plt.figure( figsize=(nrow,rows) )
plt.imshow(grid.numpy().transpose((1, 2, 0)))
filter = layers[0].data.cpu().clone()
visTensor(filter, ch=0, allkernels=False)
plt.axis('off')
plt.ioff()
plt.show()
<Figure size 800x500 with 0 Axes>
This looks rather incomprehensible. I will use a sample image to form a feature map and try to visualise what type of feature is being extracted, and apply filter labels to all of them.
Applying the weights of our layers to a functional conv2d with a sample image as our input
def featureMap(convWeights,image_input,plotX=8,plotY=4):
cuda0 = torch.device('cuda:0')
image_np = image_input.cpu().numpy()
image_map = F.conv2d(image_input.to(cuda0), convWeights.data.to(cuda0), padding=1)
image_map = image_map.cpu()
fig, ax = plt.subplots(plotY, plotX, figsize=(20, 20))
for i, subplot in enumerate(ax.ravel()):
image = image_map[i]
subplot.imshow(image, cmap="gray")
subplot.set_title(f"Filter {i+1}")
subplot.axis("off")
plt.show()
Original images of sample 0 & 1
plt.imshow(train_input_np[0], cmap='gray')
plt.title('Sample 0')
plt.show()
plt.imshow(train_input_np[1], cmap='gray')
plt.title('Sample 1')
plt.show()
I will just analysis the feature maps of 2 sample image (Comparing all would be too long)
print('Shoe feature map')
featureMap(layers[0],train_input[0],8,4)
print('\nShirt feature map')
featureMap(layers[0],train_input[1],8,4)
Shoe feature map
Shirt feature map
Just from looking at these 2 feature maps I can sort of tell what the filter is kind of taking. For example, filter 1,27,29 is inversed and gets a good general shape of the image. Filters 10, 20, 26, 28 are getting more features on the right edges of the object (more activated - brighter). Filters 2, 7, 13, 18 are getting more features on the left edges of the objects. Maybe a few filters might seem redundant or too similar as this is a simple dataset and maybe choose a simpler model...but I can't really guess the importance of each filter since although the there are filters that might not look significant or similar but down the neutral network it might contribute a lot of 'activation' for a particular class.
Final model file is likely removed before submission
torch.save(SimpleNet,'PartA_FinalModel')
All in all, I was able to slowly improve my model along this FashionMNIST CNN journey. I have learnt many essential fundamentals of CNN such as not being afraid to explore different architectures, and even modifying it and exploring more efficient ways to improve my model. From this experience, I am well-equipped to explore harder datasets such as CIFAR100.
End of Part A